Closed jmfernandez closed 8 months ago
The workflow run crates will have to conformTo
the 0.2 version of the profile, to be released soon -- see https://github.com/ResearchObject/workflow-run-crate/pull/55.
I have been updating previous examples, as well as adding new, real life, ones. New examples in this repo only include the generated ro-crate-metadata.json and either a copy of the workflow or its packed version, as the used inputs and containers need several GBs.
Thanks for the updates, José María. Looking at the previous examples, progress has been made. However, some issues remain, and the new workflows bring some other issues. I'll list what I've found below.
Crates needs to specify a license
. We've also been discussing this aspect with the Sapporo team here.
Some lists still have duplicates. For instance, the workflow's input
and output
in cosifer-nxf_provenance
.
softwareRequirements
is a property of SoftwareApplication
, so SoftwareApplication
should be added to the workflow's type list when softwareRequirements
is used on a workflow.
The additionalType
for entities representing CWL string parameters (e.g., for separator
) needs to be Text
, not String
(String
is not a term in the RO-Crate context).
Consolidation actions have WfExS-backend as their agent. The agent
of an action should be a Person
or Organization
, not a SoftwareApplication
. I guess that here there is an attempt to represent the fact that cwltool was launched by WfExS-backend, but elsewhere in the crate this kind of relationship is modeled with softwareRequirements
. So I would remove the agent
entry, change the instrument
of the action to WfExS-backend and list cwltool as a software requirement (this was discussed at the meeting yesterday).
docker://
URIs are used as contextual entity @id
s in several places. I don't think this is forbidden by the RO-Crate spec, but the examples you find there are usually http(s) and mailto. If less common schemes are used, I think the crate should include a README to help the user understand what they are about and how to interact with them.
The sha256
term is not in the https://w3id.org/ro/crate/1.1/context
, but we have it in the workflow-run ro-terms context. To bring in the definition, change the @context
entry as follows:
"@context": [
"https://w3id.org/ro/crate/1.1/context",
"https://w3id.org/ro/terms/workflow-run"
]
cosifer-cwl_staged
Since the only action is the consolidation, which is not a workflow execution, this crate should not conformsTo
Workflow Run Crate (https://w3id.org/ro/wfrun/workflow/0.2
).
The root dataset should to link to the CreateAction
instance via mentions
.
The crate contains "dangling" PropertyValue
entities, i.e., not listed in any action's object
. The fact that they don't appear in the action's object
is normal since what's described is not the workflow run but its consolidation. Those PropertyValue
entries should be removed. The same goes for the "inputs/data_matrix.csv" File
: it should be removed since it's not involved in the described process. Is there any other reason to include them? Does WfExS use such "staged" RO-Crates to run the workflow, reading parameter settings from them? If so, this should be explained in the README and in the paper.
cosifer-nxf_provenance
nextflow.config
in hasPart
, but hasPart
should only list the tools orchestrated by the workflow. nextflow.config
should be listed in the action's "object" (see referencing configuration files). hasPart
in the workflow can be omitted altogether.cosifer-nxf_staged
This crate does not have a CreateAction
, so it should not conformsTo
https://w3id.org/ro/wfrun/process/0.2
nor to https://w3id.org/ro/wfrun/workflow/0.2
. It's a plain Workflow RO-Crate.
Same consideration as cosifer-cwl_staged
for dangling entities.
wombat-pipelines_provenance
Some of the entities listed in the workflow's hasPart
seem to be configuration files rather than tools, see note above for cosifer-nxf_provenance
.
Some of the action's object point to Intangible
entities, which we are not part of the model. As discussed at the meeting yesterday, this was an attempt to represent a null value. We agreed to add "valueRequired": "False"
in the FormalParameter
and omit the value from the object
listing instead. Ideally, however, we should be able to explicitly state that a defaultValue
is null in the FormalParameter
and to express that a value was set to null in a PropertyValue
for languages that allow this, but how? @stain can Javascript null
be used directly in RO-Crate (e.g., "value": null
)?
There are some type mismatches between FormalParameter
s and corresponding PropertyValue
s. For instance, workflow/main.nf#param:run_statistics
has an additionalType
of Integer
but the corresponding PropertyValue
has a boolean value.
nfcore-rnaseq_provenance
Same issues as wombat-pipelines_provenance
, plus:
This crate uses git+https://
and s3://
schemes for data entities. What's the consumer supposed to do with them? Considerations are similar to those made for docker://
above, but these are data entities, so the consumer should know how to retrieve them.
There's a Collection
that has an Intangible
as its mainEntity
. The mainEntity
should be an entry point to a composite dataset (e.g. an index file), as explained in Representing multi-file objects. If such a "special" file exists in the collection it should be listed both as the mainEntity
and under hasPart
.
Wetlab2Variations_CWL_provenance
This crate uses trs://
schemes for data entities. Same consideration as those made for git+https://
etc.
The #e0b54cc1-05bb-46cd-9f62-b44293f4c053
collection has a mainEntity
that's not listed in the collection's hasPart
I have revised the main issues. I still have to revise the other ones, and figure out the best way to add and semantically relate a README file describing the meaning and usage of the different schemes to the generated RO-Crates
The biggest issue with the recent changes is that CreateAction
instances have instrument
properties with multiple values. CreateAction
instances need to have a single instrument
, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as its instrument
. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application's SoftwareRequirements
as you have been doing all along. Multiple values for instrument
break runcrate report
BTW.
Other issues I've found with the latest changes:
The root dataset should not have an "about": {"@id": "README.md"}
; it's the readme that is about the crate, not the other way around, so the "README.md" entity should have an "about": {"@id": "./"}
.
The entities used to represent Docker images link to a JSON file via hasPart
. I think this should be subjectOf
instead, though it's redundant since the entities that represent the JSON files have an about
pointing to the Docker image entities. By the way, what are these JSON files? Do they conform to some standard for describing images / containers?
The biggest issue with the recent changes is that
CreateAction
instances haveinstrument
properties with multiple values.CreateAction
instances need to have a singleinstrument
, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as itsinstrument
. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application'sSoftwareRequirements
as you have been doing all along. Multiple values forinstrument
breakruncrate report
BTW.
I understand it, I have applied the change. The reason I added multiple instruments to the CreateAction is that I realized a workflow can be run in different containerization modes, so fully describing the workflow execution (not the workflow itself) requires either adding all the software requirements under instrument
, declaring all the instruments together in a collection and adding the collection under instrument or declaring them in a different place (although not completely correct, under softwareAddOn
). For instance, an nf-core workflow can be run either using conda mode (so the dependencies should be linked in some place) or singularity (the used containers should be pointed out), plus other tangential software execution details, like a queue system.
So, my conclusion is that the containers and the container engine should not appear under softwareRequirements of the computational workflow as such, as they could not appear in a different execution. So, in the mean time, I'm declaring them under softwareAddOn.
So, in the long term, I think we should distinguish between the workflow as SoftwareSourceCode (the targetPlatform is the workflow engine) and the instantiated workflow as SoftwareApplication (the softwareRequirements are the workflow engine, the containers, etc...). What do you think?
Other issues I've found with the latest changes:
* The root dataset should not have an `"about": {"@id": "README.md"}`; it's the readme that is about the crate, not the other way around, so the "README.md" entity should have an `"about": {"@id": "./"}`.
Thanks! I didn't realize I put the relation in the wrong way.
* The entities used to represent Docker images link to a JSON file via `hasPart`. I think this should be `subjectOf` instead, though it's redundant since the entities that represent the JSON files have an `about` pointing to the Docker image entities. By the way, what are these JSON files? Do they conform to some standard for describing images / containers?
These JSON are metadata gathered by WfExS-backend, and depending on whether Singularity/Apptainer or Docker/Podman have one or another format. They help WfExS-backend to identify when the original container and the contents from the cache (or RO-Crate) do not match. I have just added a couple of paragraphs to the automatically included README.md files, giving some details.
Following with your last questions, when docker or podman modes have been used to run the workflow, the metadata of the images can be obtained using docker inspect imagetag
or podman inspect imagetag
. This metadata is an array describing each layer of the image, and is preserved under the manifests
key. The other keys keep digests and an id which is stable with docker save + docker load operations.
When Singularity/Apptainer mode is used for the workflow execution, container images can come from either http requests or from docker registries. Any of them are materialized by singularity pull
, but that command does not preserve the metadata from docker registries, like the original id and original layers of the container. So, for singularity images created from docker ones the code asks through REST API to get the original container image metadata from the registry.
So, although part from those JSON files contain original metadata, they are augmented with additional details.
I think distinguishing between "just code" and actual running application would be very hard at this point, since the whole model (and the tooling) is based on code/application being on the prospective part and actions on the retrospective part. Even at the lowest level of Process Run Crate it says that the application's type should include SoftwareApplication, SoftwareSourceCode or ComputationalWorkflow. The main problem with softwareAddOn
is that it's still on the prospective side (in SoftwareApplication
). We need a way to associate the container image with the action. Can you try using the current ContainerImage proposal? For instance, for cosifer-cwl_provenance it should look like:
{
"@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
"@type": "CreateAction",
"containerImage": "#cosifer-image"
},
{
"@id": "#cosifer-image",
"@type": "ContainerImage",
"additionalType": "DockerImage",
"registry": "docker.io",
"name": "tsenit/cosifer",
"tag": "b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}
We could add URL
to the range of containerImage
, so you can also do:
{
"@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
"@type": "CreateAction",
"containerImage": "docker://tsenit/cosifer:b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}
Regarding the JSON files about the container images, the container images should not refer to them via hasPart
.
I have updated all the examples, so they are now using ContainerImage
. Also, they should be reflecting the cases where the original source of the container is a Docker registry, but the tool used to materialize it was singularity/apptainer
Looking good for the most part. In Wetlab2Variations_CWL_provenance
, ContainerImage
entities have @id
s that are neither URIs nor strings starting with #
. This should not happen since these are contextual entities. For some reason it seems that the naming scheme used in other crates is not applied here.
Other remarks (for all crates):
softwareAddOn
and put everything under softwareRequirements
sha256
property. E.g.:{
"@id": "docker://docker.io/node:slim",
"@type": [
"ContainerImage",
"SoftwareApplication"
],
"additionalType": "DockerImage",
"applicationCategory": "https://www.wikidata.org/wiki/Q51294208",
"name": "docker.io/node",
"operatingSystem": "linux",
"processorRequirements": "amd64",
"registry": "docker.io",
"softwareRequirements": {
"@id": "https://apptainer.org/"
},
"softwareVersion": "library/node@sha256:dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e",
"tag": "slim",
"sha256": "dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e"
}
Version 0.3 of the profiles, which include the specs on container images, has been released, so you can change the conformsTo
entries accordingly.
ro-crate-py 0.9.0 has also been released and includes https://github.com/ResearchObject/ro-crate-py/pull/162, so you can use it to add the whole workflow-run
context.
Nice! I'm doing it along these days!
Merging to give a home to the examples. Remaining updates can be done in a subsequent PR
I have just added the generated RO-Crates from the execution of two toy workflows with WfExS-backend.