Closed pauldg closed 1 year ago
The CreateAction
needs to be linked to from the root data entity via mentions
:
{
"@id": "./",
"@type": "Dataset",
"mentions": {"@id": "#b91b07ec-5752-465d-a0c4-912e0312abc0"},
...
}
The CreateAction
has no startTime
or endTime
; it should have at least an endTime
. Is this info available somehow? E.g., latest output creation date.
Should also expand conformsTo
to match now released https://www.researchobject.org/workflow-run-crate/profiles/workflow_run_crate
"@id": "./",
"@type": "Dataset",
"conformsTo": [
{"@id": "https://w3id.org/ro/wfrun/process/0.1"},
{"@id": "https://w3id.org/ro/wfrun/workflow/0.1"},
{"@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0"}
],
and their static contextual entities:
{ "@id": "https://w3id.org/ro/wfrun/process/0.1",
"@type": "CreativeWork",
"name": "Process Run Crate",
"version": "0.1"
},
{ "@id": "https://w3id.org/ro/wfrun/workflow/0.1",
"@type": "CreativeWork",
"name": "Workflow Run Crate",
"version": "0.1"
},
{ "@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0",
"@type": "CreativeWork",
"name": "Workflow RO-Crate",
"version": "1.0"
},
The various _attrs.txt
files seems useful for Galaxy debugging, but don't appear in the RO-Crate metadata JSON, so it's a bit cryptic what they are for or relate to.
They seem to be JSON files, but have the .txt
extension - so they can use encodingFormat
to explain that in the metadata. Ideally they can also link to their own conformsTo
if there is some documentation about each.
I'm not sure if there's a way to reply with the updated output comment by comment? In any case, please let me know if these changes are adequate.
please let me know if these changes are adequate
The changes look good! However, there's another problem I hadn't noticed before: the representation of inputs and outputs does not match the workflow's structure. The workflow takes two collections and merges them into a single one, then concatenates the datasets from the merged collection into a single dataset and finally selects some lines from the concatenated dataset. The input parameters of the workflow should be the two input collections and the parameter that controls the number of lines for the final selection (plus the advanced parameter, which seems to control the handling of merge conflicts), while the output should be the file containing the selected lines. The current metadata file has individual files from the collections as inputs instead; also, inputs include the concatenated dataset, which is an intermediate output.
The workflow's input
and output
should look like this:
{
...
"input": [
{"@id": "#lineNum-param"},
{"@id": "#advanced-param"},
{"@id": "#collection1-param"},
{"@id": "#collection2-param"}
],
"output": [
{"@id": "#4a0f4078-5aff-4e02-9f9c-4ad510050e54"}
],
...
},
{
"@id": "#collection1-param",
"@type": "FormalParameter",
"additionalType": "Collection",
"name": "collection 1"
},
{
"@id": "#collection2-param",
"@type": "FormalParameter",
"additionalType": "Collection",
"name": "collection 2"
}
The action's object
and result
should look like this:
{
...
"object": [
{"@id": "#lineNum-pv"},
{"@id": "#advanced-pv"},
{"@id": "#dataset_collection-11"},
{"@id": "#dataset_collection-10"}
],
"result": [
{"@id": "datasets/Select_first_on_data_48_49.txt"}
],
...
}
with #dataset_collection-11
pointing to #collection1-param
via exampleOfWork
, and similar for the other collection. Individual input files should not have an exampleOfWork
(they participate as members of their collections). The intermediate merged collection should not be in the crate at all.
That was the main thing. I've also found two minor issues:
additionalType
. That's not understood in the RO-Crate context. If there is a URI that leads to a description of the list type in Galaxy, that would be a good value. If not, better remove the additionalType
entry.subjectOf
I have pushed here the full expected metadata file with all the changes.
Thanks Simone, that clears up some of the details of the format. I'll continue with it.
I've made further updates to the code addressing the required changes. Unfortunately the diff with the previous version is a bit difficult to make since I moved around some parts.
A few things to note:
.gxwf.yml
is the new standard representation for galaxy workflows so I've made that the main entity and the cwl representation is connected using subjectOf there.advanced
parameter, which controls the handling of merge conflicts, is a tool parameter rather than a workflow parameter, which means that I'm not able to provide a different value to this parameter when (re-)running the workflow. It became "hardcoded" in the workflow definition when I created the workflow. The only way to change the value of this parameter would be to change the workflow definition. On the other hand for the num_lines_param
I have enabled this to be a workflow parameter and so I can provide different values for it every time I rerun the workflow (the value is than provided to the tool at runtime). The question is thus whether the advanced
parameter should be included in the ro-crate at all?The intermediary collection and the concatenated collection are both defined in the workflow as an output and thus they are listed as outputs.
OK. If they are workflow outputs for Galaxy, they have to be listed as workflow outputs in the RO-Crate as well.
Regarding the "advanced" parameter, if it's not enabled as a workflow parameter then it should not be included in the RO-Crate. However, I have some comments regarding its representation, which would become relevant in those cases where such parameters would have to be included. In the current version of the example, the PropertyValue
is:
{
"@id": "#advanced-pv",
"@type": "PropertyValue",
"exampleOfWork": {"@id": "#advanced-param"},
"name": "merge collections tool PropertyValue",
"value": {
"conflict": {
"__current_case__": 0,
"duplicate_options": "suffix_conflict",
"suffix_pattern": "_#"
}
}
}
I.e., the value has been inserted as JSON and merged with the overall JSON structure, making the RO-Crate invalid. In the previous version, instead, the value was inserted as a string, which is OK:
"value": "{\"conflict\": {\"__current_case__\": 0, \"duplicate_options\": \"suffix_conflict\", \"suffix_pattern\": \"_#\"}}"
Also, I think that "advanced" refers to the whole set of "hidden" parameters in the Galaxy interface, and there could be more than one. So the parameter should actually be called "conflict", leading to something like:
{
"@id": "#conflict-pv",
"@type": "PropertyValue",
"exampleOfWork": {"@id": "#conflict-param"},
"name": "conflict",
"value": "{\"__current_case__\": 0, \"duplicate_options\": \"suffix_conflict\", \"suffix_pattern\": \"_#\"}"
},
{
"@id": "#conflict-param",
"@type": "FormalParameter",
"additionalType": "Text",
"name": "conflict",
"valueRequired": "False"
},
But, again, in this specific case the parameter should not be included.
Here's a list of issues I've found in the current version of the example:
exampleOfWork
links are broken because they are missing the leading hash mark. For instance, dataset_collection-10-param
should be #dataset_collection-10-param
.#num_lines_param-param
, but the entity is not in the crateadditionalType
for formal parameters corresponding to collections should be Collection
datasets/hello_33.txt
in #dataset_collection-13
I've pushed the expected metadata file according to the above changes here.
I agree with all changes, the only one I have doubts about is this:
- There's a duplicate reference to
datasets/hello_33.txt
in#dataset_collection-13
This is intended since the two input collections reference the same input dataset and the input datasets use the filename as the id:
{
"@id": "#dataset_collection-11",
"@type": "Collection",
"hasPart": [
{
"@id": "datasets/hello_33.txt"
},
{
"@id": "datasets/world_34.txt"
}
],
},
and
{
"@id": "#dataset_collection-10",
"@type": "Collection",
"hasPart": [
{
"@id": "datasets/hello_33.txt"
},
{
"@id": "datasets/universe_31.txt"
}
],
},
This is intended since the two input collections reference the same input dataset and the input datasets use the filename as the id
Yes, but when you merge the collections only one of the datasets with a repeated name is included in the merged collection. This is explained here (in the "Merge collections" subsection). From the RO-Crate metadata file's standpoint it's the same, duplicate entries do not make sense: though multiple values are represented as JSON lists, their JSON-LD semantics is basically that of sets.
Updated the code base to address the required changes.
About the merged collections, in the example workflow the collections are merged using the advanced parameter that handles conflicts (see the screenshot bellow). So there are two references to one dataset but the two elements of the collection do receive a unique "element identifier". Should this be expressed somehow in the ro-crate metadata?
About the merged collections, in the example workflow the collections are merged using the advanced parameter that handles conflicts (see the screenshot bellow). So there are two references to one dataset but the two elements of the collection do receive a unique "element identifier". Should this be expressed somehow in the ro-crate metadata?
Since the selected conflict handler appends suffixes to conflicted element identifiers, the same can be done in the RO-Crate: the generator can add two copies of datasets/hello_33.txt
to the crate, named datasets/hello_33_1.txt
and datasets/hello_33_2.txt
, then the hasPart
of the merged collection can be:
"hasPart": [
{
"@id": "datasets/hello_33_1.txt"
},
{
"@id": "datasets/world_34.txt"
},
{
"@id": "datasets/hello_33_2.txt"
},
{
"@id": "datasets/universe_31.txt"
}
]
An alternative is to slightly change the example workflow to use the default conflict resolution, which keeps only one copy.
An alternative is to slightly change the example workflow to use the default conflict resolution, which keeps only one copy.
I've changed the conflict resolution parameter for merging collections to the default, keep first. Some of the elements in ro-crate metadata have been reordered and renamed.
Looks good! Merging. For the Zenodo upload, please use a zip file: if you do that, Zenodo recognizes the format and generates a summary of contained files to display in the record's page. If you can, use the .crate.zip
extension, so it's compatible with WorkflowHub. Note that the zip needs to contain directly the contents of the RO-Crate, so that ro-crate-metadata.json
is at the top level.
Example Galaxy Workflowrun ro-crate which includes Galaxy features: collections and parameters