ResearchObject / workflow-run-crate

Workflow Run RO-Crate profile
https://www.researchobject.org/workflow-run-crate/
Apache License 2.0
8 stars 9 forks source link

Example Galaxy Workflowrun ro-crate #44

Closed pauldg closed 1 year ago

pauldg commented 1 year ago

Example Galaxy Workflowrun ro-crate which includes Galaxy features: collections and parameters

simleo commented 1 year ago

The CreateAction needs to be linked to from the root data entity via mentions:

{
    "@id": "./",
    "@type": "Dataset",
    "mentions": {"@id": "#b91b07ec-5752-465d-a0c4-912e0312abc0"},
    ...
}
simleo commented 1 year ago

The CreateAction has no startTime or endTime; it should have at least an endTime. Is this info available somehow? E.g., latest output creation date.

stain commented 1 year ago

Should also expand conformsTo to match now released https://www.researchobject.org/workflow-run-crate/profiles/workflow_run_crate

        "@id": "./",
        "@type": "Dataset",
        "conformsTo": [
            {"@id": "https://w3id.org/ro/wfrun/process/0.1"},
            {"@id": "https://w3id.org/ro/wfrun/workflow/0.1"},
            {"@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0"}
        ],

and their static contextual entities:

    {   "@id": "https://w3id.org/ro/wfrun/process/0.1",
        "@type": "CreativeWork",
        "name": "Process Run Crate",
        "version": "0.1"
    },
    {   "@id": "https://w3id.org/ro/wfrun/workflow/0.1",
        "@type": "CreativeWork",
        "name": "Workflow Run Crate",
        "version": "0.1"
    },
    {   "@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0",
        "@type": "CreativeWork",
        "name": "Workflow RO-Crate",
        "version": "1.0"
    },
stain commented 1 year ago

The various _attrs.txt files seems useful for Galaxy debugging, but don't appear in the RO-Crate metadata JSON, so it's a bit cryptic what they are for or relate to.

They seem to be JSON files, but have the .txt extension - so they can use encodingFormat to explain that in the metadata. Ideally they can also link to their own conformsTo if there is some documentation about each.

pauldg commented 1 year ago

I'm not sure if there's a way to reply with the updated output comment by comment? In any case, please let me know if these changes are adequate.

simleo commented 1 year ago

please let me know if these changes are adequate

The changes look good! However, there's another problem I hadn't noticed before: the representation of inputs and outputs does not match the workflow's structure. The workflow takes two collections and merges them into a single one, then concatenates the datasets from the merged collection into a single dataset and finally selects some lines from the concatenated dataset. The input parameters of the workflow should be the two input collections and the parameter that controls the number of lines for the final selection (plus the advanced parameter, which seems to control the handling of merge conflicts), while the output should be the file containing the selected lines. The current metadata file has individual files from the collections as inputs instead; also, inputs include the concatenated dataset, which is an intermediate output.

The workflow's input and output should look like this:

{
    ...
    "input": [
        {"@id": "#lineNum-param"},
        {"@id": "#advanced-param"},
        {"@id": "#collection1-param"},
        {"@id": "#collection2-param"}
    ],
    "output": [
        {"@id": "#4a0f4078-5aff-4e02-9f9c-4ad510050e54"}
    ],
    ...
},
{
    "@id": "#collection1-param",
    "@type": "FormalParameter",
    "additionalType": "Collection",
    "name": "collection 1"
},
{
    "@id": "#collection2-param",
    "@type": "FormalParameter",
    "additionalType": "Collection",
    "name": "collection 2"
}

The action's object and result should look like this:

{
    ...
    "object": [
        {"@id": "#lineNum-pv"},
        {"@id": "#advanced-pv"},
        {"@id": "#dataset_collection-11"},
        {"@id": "#dataset_collection-10"}
    ],
    "result": [
        {"@id": "datasets/Select_first_on_data_48_49.txt"}
    ],
    ...
}

with #dataset_collection-11 pointing to #collection1-param via exampleOfWork, and similar for the other collection. Individual input files should not have an exampleOfWork (they participate as members of their collections). The intermediate merged collection should not be in the crate at all.

That was the main thing. I've also found two minor issues:

I have pushed here the full expected metadata file with all the changes.

pauldg commented 1 year ago

Thanks Simone, that clears up some of the details of the format. I'll continue with it.

pauldg commented 1 year ago

I've made further updates to the code addressing the required changes. Unfortunately the diff with the previous version is a bit difficult to make since I moved around some parts.

A few things to note:

simleo commented 1 year ago

The intermediary collection and the concatenated collection are both defined in the workflow as an output and thus they are listed as outputs.

OK. If they are workflow outputs for Galaxy, they have to be listed as workflow outputs in the RO-Crate as well.

Regarding the "advanced" parameter, if it's not enabled as a workflow parameter then it should not be included in the RO-Crate. However, I have some comments regarding its representation, which would become relevant in those cases where such parameters would have to be included. In the current version of the example, the PropertyValue is:

{
    "@id": "#advanced-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#advanced-param"},
    "name": "merge collections tool PropertyValue",
    "value": {
        "conflict": {
            "__current_case__": 0,
            "duplicate_options": "suffix_conflict",
            "suffix_pattern": "_#"
        }
    }
}

I.e., the value has been inserted as JSON and merged with the overall JSON structure, making the RO-Crate invalid. In the previous version, instead, the value was inserted as a string, which is OK:

    "value": "{\"conflict\": {\"__current_case__\": 0, \"duplicate_options\": \"suffix_conflict\", \"suffix_pattern\": \"_#\"}}"

Also, I think that "advanced" refers to the whole set of "hidden" parameters in the Galaxy interface, and there could be more than one. So the parameter should actually be called "conflict", leading to something like:

{
    "@id": "#conflict-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#conflict-param"},
    "name": "conflict",
    "value": "{\"__current_case__\": 0, \"duplicate_options\": \"suffix_conflict\", \"suffix_pattern\": \"_#\"}"
},
{
    "@id": "#conflict-param",
    "@type": "FormalParameter",
    "additionalType": "Text",
    "name": "conflict",
    "valueRequired": "False"
},

But, again, in this specific case the parameter should not be included.

Here's a list of issues I've found in the current version of the example:

I've pushed the expected metadata file according to the above changes here.

pauldg commented 1 year ago

I agree with all changes, the only one I have doubts about is this:

  • There's a duplicate reference to datasets/hello_33.txt in #dataset_collection-13

This is intended since the two input collections reference the same input dataset and the input datasets use the filename as the id:

{
            "@id": "#dataset_collection-11",
            "@type": "Collection",
            "hasPart": [
                {
                    "@id": "datasets/hello_33.txt"
                },
                {
                    "@id": "datasets/world_34.txt"
                }
            ],
        },

and

        {
            "@id": "#dataset_collection-10",
            "@type": "Collection",
            "hasPart": [
                {
                    "@id": "datasets/hello_33.txt"
                },
                {
                    "@id": "datasets/universe_31.txt"
                }
            ],
        },
simleo commented 1 year ago

This is intended since the two input collections reference the same input dataset and the input datasets use the filename as the id

Yes, but when you merge the collections only one of the datasets with a repeated name is included in the merged collection. This is explained here (in the "Merge collections" subsection). From the RO-Crate metadata file's standpoint it's the same, duplicate entries do not make sense: though multiple values are represented as JSON lists, their JSON-LD semantics is basically that of sets.

pauldg commented 1 year ago

Updated the code base to address the required changes.

About the merged collections, in the example workflow the collections are merged using the advanced parameter that handles conflicts (see the screenshot bellow). So there are two references to one dataset but the two elements of the collection do receive a unique "element identifier". Should this be expressed somehow in the ro-crate metadata?

image

simleo commented 1 year ago

About the merged collections, in the example workflow the collections are merged using the advanced parameter that handles conflicts (see the screenshot bellow). So there are two references to one dataset but the two elements of the collection do receive a unique "element identifier". Should this be expressed somehow in the ro-crate metadata?

Since the selected conflict handler appends suffixes to conflicted element identifiers, the same can be done in the RO-Crate: the generator can add two copies of datasets/hello_33.txt to the crate, named datasets/hello_33_1.txt and datasets/hello_33_2.txt, then the hasPart of the merged collection can be:

"hasPart": [
    {
        "@id": "datasets/hello_33_1.txt"
    },
    {
        "@id": "datasets/world_34.txt"
    },
    {
        "@id": "datasets/hello_33_2.txt"
    },
    {
        "@id": "datasets/universe_31.txt"
    }
]
simleo commented 1 year ago

An alternative is to slightly change the example workflow to use the default conflict resolution, which keeps only one copy.

pauldg commented 1 year ago

An alternative is to slightly change the example workflow to use the default conflict resolution, which keeps only one copy.

I've changed the conflict resolution parameter for merging collections to the default, keep first. Some of the elements in ro-crate metadata have been reordered and renamed.

simleo commented 1 year ago

Looks good! Merging. For the Zenodo upload, please use a zip file: if you do that, Zenodo recognizes the format and generates a summary of contained files to display in the record's page. If you can, use the .crate.zip extension, so it's compatible with WorkflowHub. Note that the zip needs to contain directly the contents of the RO-Crate, so that ro-crate-metadata.json is at the top level.