ResearchObject / workflow-run-crate

Workflow Run RO-Crate profile
https://www.researchobject.org/workflow-run-crate/
Apache License 2.0
8 stars 9 forks source link

Added initial WfExS-backend examples based on toy workflow. #53

Closed jmfernandez closed 8 months ago

jmfernandez commented 1 year ago

I have just added the generated RO-Crates from the execution of two toy workflows with WfExS-backend.

simleo commented 1 year ago

The workflow run crates will have to conformTo the 0.2 version of the profile, to be released soon -- see https://github.com/ResearchObject/workflow-run-crate/pull/55.

jmfernandez commented 10 months ago

I have been updating previous examples, as well as adding new, real life, ones. New examples in this repo only include the generated ro-crate-metadata.json and either a copy of the workflow or its packed version, as the used inputs and containers need several GBs.

simleo commented 10 months ago

Thanks for the updates, José María. Looking at the previous examples, progress has been made. However, some issues remain, and the new workflows bring some other issues. I'll list what I've found below.

All (or most) crates

"@context": [
    "https://w3id.org/ro/crate/1.1/context",
    "https://w3id.org/ro/terms/workflow-run"
]

cosifer-cwl_staged

cosifer-nxf_provenance

cosifer-nxf_staged

wombat-pipelines_provenance

nfcore-rnaseq_provenance

Same issues as wombat-pipelines_provenance, plus:

Wetlab2Variations_CWL_provenance

jmfernandez commented 9 months ago

I have revised the main issues. I still have to revise the other ones, and figure out the best way to add and semantically relate a README file describing the meaning and usage of the different schemes to the generated RO-Crates

simleo commented 9 months ago

The biggest issue with the recent changes is that CreateAction instances have instrument properties with multiple values. CreateAction instances need to have a single instrument, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as its instrument. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application's SoftwareRequirements as you have been doing all along. Multiple values for instrument break runcrate report BTW.

Other issues I've found with the latest changes:

jmfernandez commented 9 months ago

The biggest issue with the recent changes is that CreateAction instances have instrument properties with multiple values. CreateAction instances need to have a single instrument, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as its instrument. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application's SoftwareRequirements as you have been doing all along. Multiple values for instrument break runcrate report BTW.

I understand it, I have applied the change. The reason I added multiple instruments to the CreateAction is that I realized a workflow can be run in different containerization modes, so fully describing the workflow execution (not the workflow itself) requires either adding all the software requirements under instrument, declaring all the instruments together in a collection and adding the collection under instrument or declaring them in a different place (although not completely correct, under softwareAddOn). For instance, an nf-core workflow can be run either using conda mode (so the dependencies should be linked in some place) or singularity (the used containers should be pointed out), plus other tangential software execution details, like a queue system.

So, my conclusion is that the containers and the container engine should not appear under softwareRequirements of the computational workflow as such, as they could not appear in a different execution. So, in the mean time, I'm declaring them under softwareAddOn.

So, in the long term, I think we should distinguish between the workflow as SoftwareSourceCode (the targetPlatform is the workflow engine) and the instantiated workflow as SoftwareApplication (the softwareRequirements are the workflow engine, the containers, etc...). What do you think?

Other issues I've found with the latest changes:

* The root dataset should not have an `"about": {"@id": "README.md"}`; it's the readme that is about the crate, not the other way around, so the "README.md" entity should have an `"about": {"@id": "./"}`.

Thanks! I didn't realize I put the relation in the wrong way.

* The entities used to represent Docker images link to a JSON file via `hasPart`. I think this should be `subjectOf` instead, though it's redundant since the entities that represent the JSON files have an `about` pointing to the Docker image entities. By the way, what are these JSON files? Do they conform to some standard for describing images / containers?

These JSON are metadata gathered by WfExS-backend, and depending on whether Singularity/Apptainer or Docker/Podman have one or another format. They help WfExS-backend to identify when the original container and the contents from the cache (or RO-Crate) do not match. I have just added a couple of paragraphs to the automatically included README.md files, giving some details.

Following with your last questions, when docker or podman modes have been used to run the workflow, the metadata of the images can be obtained using docker inspect imagetag or podman inspect imagetag. This metadata is an array describing each layer of the image, and is preserved under the manifests key. The other keys keep digests and an id which is stable with docker save + docker load operations.

When Singularity/Apptainer mode is used for the workflow execution, container images can come from either http requests or from docker registries. Any of them are materialized by singularity pull, but that command does not preserve the metadata from docker registries, like the original id and original layers of the container. So, for singularity images created from docker ones the code asks through REST API to get the original container image metadata from the registry.

So, although part from those JSON files contain original metadata, they are augmented with additional details.

simleo commented 9 months ago

I think distinguishing between "just code" and actual running application would be very hard at this point, since the whole model (and the tooling) is based on code/application being on the prospective part and actions on the retrospective part. Even at the lowest level of Process Run Crate it says that the application's type should include SoftwareApplication, SoftwareSourceCode or ComputationalWorkflow. The main problem with softwareAddOn is that it's still on the prospective side (in SoftwareApplication). We need a way to associate the container image with the action. Can you try using the current ContainerImage proposal? For instance, for cosifer-cwl_provenance it should look like:

{
    "@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
    "@type": "CreateAction",
    "containerImage": "#cosifer-image"
},
{
    "@id": "#cosifer-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "registry": "docker.io",
    "name": "tsenit/cosifer",
    "tag": "b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}

We could add URL to the range of containerImage, so you can also do:

{
    "@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
    "@type": "CreateAction",
    "containerImage": "docker://tsenit/cosifer:b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}

Regarding the JSON files about the container images, the container images should not refer to them via hasPart.

jmfernandez commented 8 months ago

I have updated all the examples, so they are now using ContainerImage. Also, they should be reflecting the cases where the original source of the container is a Docker registry, but the tool used to materialize it was singularity/apptainer

simleo commented 8 months ago

Looking good for the most part. In Wetlab2Variations_CWL_provenance, ContainerImage entities have @ids that are neither URIs nor strings starting with #. This should not happen since these are contextual entities. For some reason it seems that the naming scheme used in other crates is not applied here.

Other remarks (for all crates):

{
    "@id": "docker://docker.io/node:slim",
    "@type": [
        "ContainerImage",
        "SoftwareApplication"
    ],
    "additionalType": "DockerImage",
    "applicationCategory": "https://www.wikidata.org/wiki/Q51294208",
    "name": "docker.io/node",
    "operatingSystem": "linux",
    "processorRequirements": "amd64",
    "registry": "docker.io",
    "softwareRequirements": {
        "@id": "https://apptainer.org/"
    },
    "softwareVersion": "library/node@sha256:dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e",
    "tag": "slim",
    "sha256": "dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e"
}
simleo commented 8 months ago

Version 0.3 of the profiles, which include the specs on container images, has been released, so you can change the conformsTo entries accordingly.

ro-crate-py 0.9.0 has also been released and includes https://github.com/ResearchObject/ro-crate-py/pull/162, so you can use it to add the whole workflow-run context.

jmfernandez commented 8 months ago

Nice! I'm doing it along these days!

simleo commented 8 months ago

Merging to give a home to the examples. Remaining updates can be done in a subsequent PR