Added initial WfExS-backend examples based on toy workflow.

jmfernandez commented 1 year ago

I have just added the generated RO-Crates from the execution of two toy workflows with WfExS-backend.

simleo commented 1 year ago

The workflow run crates will have to conformTo the 0.2 version of the profile, to be released soon -- see https://github.com/ResearchObject/workflow-run-crate/pull/55.

jmfernandez commented 10 months ago

I have been updating previous examples, as well as adding new, real life, ones. New examples in this repo only include the generated ro-crate-metadata.json and either a copy of the workflow or its packed version, as the used inputs and containers need several GBs.

simleo commented 10 months ago

Thanks for the updates, José María. Looking at the previous examples, progress has been made. However, some issues remain, and the new workflows bring some other issues. I'll list what I've found below.

All (or most) crates

Crates needs to specify a license. We've also been discussing this aspect with the Sapporo team here.
Some lists still have duplicates. For instance, the workflow's input and output in cosifer-nxf_provenance.
softwareRequirements is a property of SoftwareApplication, so SoftwareApplication should be added to the workflow's type list when softwareRequirements is used on a workflow.
The additionalType for entities representing CWL string parameters (e.g., for separator) needs to be Text, not String (String is not a term in the RO-Crate context).
Consolidation actions have WfExS-backend as their agent. The agent of an action should be a Person or Organization, not a SoftwareApplication. I guess that here there is an attempt to represent the fact that cwltool was launched by WfExS-backend, but elsewhere in the crate this kind of relationship is modeled with softwareRequirements. So I would remove the agent entry, change the instrument of the action to WfExS-backend and list cwltool as a software requirement (this was discussed at the meeting yesterday).
docker:// URIs are used as contextual entity @ids in several places. I don't think this is forbidden by the RO-Crate spec, but the examples you find there are usually http(s) and mailto. If less common schemes are used, I think the crate should include a README to help the user understand what they are about and how to interact with them.
The sha256 term is not in the https://w3id.org/ro/crate/1.1/context, but we have it in the workflow-run ro-terms context. To bring in the definition, change the @context entry as follows:

"@context": [
    "https://w3id.org/ro/crate/1.1/context",
    "https://w3id.org/ro/terms/workflow-run"
]

`cosifer-cwl_staged`

Since the only action is the consolidation, which is not a workflow execution, this crate should not conformsTo Workflow Run Crate (https://w3id.org/ro/wfrun/workflow/0.2).
The root dataset should to link to the CreateAction instance via mentions.
The crate contains "dangling" PropertyValue entities, i.e., not listed in any action's object. The fact that they don't appear in the action's object is normal since what's described is not the workflow run but its consolidation. Those PropertyValue entries should be removed. The same goes for the "inputs/data_matrix.csv" File: it should be removed since it's not involved in the described process. Is there any other reason to include them? Does WfExS use such "staged" RO-Crates to run the workflow, reading parameter settings from them? If so, this should be explained in the README and in the paper.

`cosifer-nxf_provenance`

The workflow lists nextflow.config in hasPart, but hasPart should only list the tools orchestrated by the workflow. nextflow.config should be listed in the action's "object" (see referencing configuration files). hasPart in the workflow can be omitted altogether.

`cosifer-nxf_staged`

This crate does not have a CreateAction, so it should not conformsTo https://w3id.org/ro/wfrun/process/0.2 nor to https://w3id.org/ro/wfrun/workflow/0.2. It's a plain Workflow RO-Crate.
Same consideration as cosifer-cwl_staged for dangling entities.

`wombat-pipelines_provenance`

Some of the entities listed in the workflow's hasPart seem to be configuration files rather than tools, see note above for cosifer-nxf_provenance.
Some of the action's object point to Intangible entities, which we are not part of the model. As discussed at the meeting yesterday, this was an attempt to represent a null value. We agreed to add "valueRequired": "False" in the FormalParameter and omit the value from the object listing instead. Ideally, however, we should be able to explicitly state that a defaultValue is null in the FormalParameter and to express that a value was set to null in a PropertyValue for languages that allow this, but how? @stain can Javascript null be used directly in RO-Crate (e.g., "value": null)?
There are some type mismatches between FormalParameters and corresponding PropertyValues. For instance, workflow/main.nf#param:run_statistics has an additionalType of Integer but the corresponding PropertyValue has a boolean value.

`nfcore-rnaseq_provenance`

Same issues as wombat-pipelines_provenance, plus:

This crate uses git+https:// and s3:// schemes for data entities. What's the consumer supposed to do with them? Considerations are similar to those made for docker:// above, but these are data entities, so the consumer should know how to retrieve them.
There's a Collection that has an Intangible as its mainEntity. The mainEntity should be an entry point to a composite dataset (e.g. an index file), as explained in Representing multi-file objects. If such a "special" file exists in the collection it should be listed both as the mainEntity and under hasPart.

`Wetlab2Variations_CWL_provenance`

This crate uses trs:// schemes for data entities. Same consideration as those made for git+https:// etc.
The #e0b54cc1-05bb-46cd-9f62-b44293f4c053 collection has a mainEntity that's not listed in the collection's hasPart

jmfernandez commented 9 months ago

I have revised the main issues. I still have to revise the other ones, and figure out the best way to add and semantically relate a README file describing the meaning and usage of the different schemes to the generated RO-Crates

simleo commented 9 months ago

The biggest issue with the recent changes is that CreateAction instances have instrument properties with multiple values. CreateAction instances need to have a single instrument, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as its instrument. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application's SoftwareRequirements as you have been doing all along. Multiple values for instrument break runcrate report BTW.

Other issues I've found with the latest changes:

The root dataset should not have an "about": {"@id": "README.md"}; it's the readme that is about the crate, not the other way around, so the "README.md" entity should have an "about": {"@id": "./"}.
The entities used to represent Docker images link to a JSON file via hasPart. I think this should be subjectOf instead, though it's redundant since the entities that represent the JSON files have an about pointing to the Docker image entities. By the way, what are these JSON files? Do they conform to some standard for describing images / containers?

jmfernandez commented 9 months ago

The biggest issue with the recent changes is that CreateAction instances have instrument properties with multiple values. CreateAction instances need to have a single instrument, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as its instrument. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application's SoftwareRequirements as you have been doing all along. Multiple values for instrument break runcrate report BTW.

I understand it, I have applied the change. The reason I added multiple instruments to the CreateAction is that I realized a workflow can be run in different containerization modes, so fully describing the workflow execution (not the workflow itself) requires either adding all the software requirements under instrument, declaring all the instruments together in a collection and adding the collection under instrument or declaring them in a different place (although not completely correct, under softwareAddOn). For instance, an nf-core workflow can be run either using conda mode (so the dependencies should be linked in some place) or singularity (the used containers should be pointed out), plus other tangential software execution details, like a queue system.

So, my conclusion is that the containers and the container engine should not appear under softwareRequirements of the computational workflow as such, as they could not appear in a different execution. So, in the mean time, I'm declaring them under softwareAddOn.

So, in the long term, I think we should distinguish between the workflow as SoftwareSourceCode (the targetPlatform is the workflow engine) and the instantiated workflow as SoftwareApplication (the softwareRequirements are the workflow engine, the containers, etc...). What do you think?

Other issues I've found with the latest changes:

* The root dataset should not have an `"about": {"@id": "README.md"}`; it's the readme that is about the crate, not the other way around, so the "README.md" entity should have an `"about": {"@id": "./"}`.

Thanks! I didn't realize I put the relation in the wrong way.

* The entities used to represent Docker images link to a JSON file via `hasPart`. I think this should be `subjectOf` instead, though it's redundant since the entities that represent the JSON files have an `about` pointing to the Docker image entities. By the way, what are these JSON files? Do they conform to some standard for describing images / containers?

These JSON are metadata gathered by WfExS-backend, and depending on whether Singularity/Apptainer or Docker/Podman have one or another format. They help WfExS-backend to identify when the original container and the contents from the cache (or RO-Crate) do not match. I have just added a couple of paragraphs to the automatically included README.md files, giving some details.

Following with your last questions, when docker or podman modes have been used to run the workflow, the metadata of the images can be obtained using docker inspect imagetag or podman inspect imagetag. This metadata is an array describing each layer of the image, and is preserved under the manifests key. The other keys keep digests and an id which is stable with docker save + docker load operations.

When Singularity/Apptainer mode is used for the workflow execution, container images can come from either http requests or from docker registries. Any of them are materialized by singularity pull, but that command does not preserve the metadata from docker registries, like the original id and original layers of the container. So, for singularity images created from docker ones the code asks through REST API to get the original container image metadata from the registry.

So, although part from those JSON files contain original metadata, they are augmented with additional details.

simleo commented 9 months ago

I think distinguishing between "just code" and actual running application would be very hard at this point, since the whole model (and the tooling) is based on code/application being on the prospective part and actions on the retrospective part. Even at the lowest level of Process Run Crate it says that the application's type should include SoftwareApplication, SoftwareSourceCode or ComputationalWorkflow. The main problem with softwareAddOn is that it's still on the prospective side (in SoftwareApplication). We need a way to associate the container image with the action. Can you try using the current ContainerImage proposal? For instance, for cosifer-cwl_provenance it should look like:

{
    "@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
    "@type": "CreateAction",
    "containerImage": "#cosifer-image"
},
{
    "@id": "#cosifer-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "registry": "docker.io",
    "name": "tsenit/cosifer",
    "tag": "b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}

We could add URL to the range of containerImage, so you can also do:

{
    "@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
    "@type": "CreateAction",
    "containerImage": "docker://tsenit/cosifer:b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}

Regarding the JSON files about the container images, the container images should not refer to them via hasPart.

jmfernandez commented 8 months ago

I have updated all the examples, so they are now using ContainerImage. Also, they should be reflecting the cases where the original source of the container is a Docker registry, but the tool used to materialize it was singularity/apptainer

simleo commented 8 months ago

Looking good for the most part. In Wetlab2Variations_CWL_provenance, ContainerImage entities have @ids that are neither URIs nor strings starting with #. This should not happen since these are contextual entities. For some reason it seems that the naming scheme used in other crates is not applied here.

Other remarks (for all crates):

I would remove softwareAddOn and put everything under softwareRequirements
Since you seem to have the hashes for the container images, you should list them using the sha256 property. E.g.:

{
    "@id": "docker://docker.io/node:slim",
    "@type": [
        "ContainerImage",
        "SoftwareApplication"
    ],
    "additionalType": "DockerImage",
    "applicationCategory": "https://www.wikidata.org/wiki/Q51294208",
    "name": "docker.io/node",
    "operatingSystem": "linux",
    "processorRequirements": "amd64",
    "registry": "docker.io",
    "softwareRequirements": {
        "@id": "https://apptainer.org/"
    },
    "softwareVersion": "library/node@sha256:dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e",
    "tag": "slim",
    "sha256": "dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e"
}

simleo commented 8 months ago

Version 0.3 of the profiles, which include the specs on container images, has been released, so you can change the conformsTo entries accordingly.

ro-crate-py 0.9.0 has also been released and includes https://github.com/ResearchObject/ro-crate-py/pull/162, so you can use it to add the whole workflow-run context.

jmfernandez commented 8 months ago

Nice! I'm doing it along these days!

simleo commented 8 months ago

Merging to give a home to the examples. Remaining updates can be done in a subsequent PR

ResearchObject / workflow-run-crate