ResearchObject / workflow-run-crate

Workflow Run RO-Crate profile
https://www.researchobject.org/workflow-run-crate/
Apache License 2.0
8 stars 9 forks source link

CQ11 - Parameter connections #25

Closed simleo closed 1 year ago

simleo commented 1 year ago

Knowing how workflow parameters were passed to individual tools is important to find out how they affected the outputs.

We are currently linking workflow and tool parameters with connectedTo from the source tool / workflow to the target tool / workflow. For instance, in revsort:

graph

we currently have:


{
    "@id": "packed.cwl#revtool.cwl",
    "@type": "SoftwareApplication",
    "input": [
        {"@id": "packed.cwl#revtool.cwl/input"}
    ],
    "output": [
        {"@id": "packed.cwl#revtool.cwl/output"}
    ]
},
{
    "@id": "packed.cwl#sorttool.cwl",
    "@type": "SoftwareApplication",
    "input": [
        {"@id": "packed.cwl#sorttool.cwl/reverse"},
        {"@id": "packed.cwl#sorttool.cwl/input"}
    ],
    "output": [
        {"@id": "packed.cwl#sorttool.cwl/output"}
    ]
},
{
    "@id": "packed.cwl#revtool.cwl/output",
    "@type": "FormalParameter",
    "connectedTo": {"@id": "packed.cwl#sorttool.cwl/input"}
}

but that's inaccurate, since such links only exist within the revsort workflow. packed.cwl#revtool.cwl and packed.cwl#sorttool.cwl represent standalone software tools that happen to be connected this way in revsort, but might be used differently in another workflow.

simleo commented 1 year ago

We need something like:

{
    "@id": "packed.cwl#main/sorted",
    "@type": "HowToStep",
    "position": "1",
    "workExample": {"@id": "packed.cwl#sorttool.cwl"},
    "parameterConnections": [
        {"@id": "#pc1"},
        ...
    ]
},
{
    "@id": "#pc1",
    "@type": "ParameterConnection",
    "source": {"@id": "packed.cwl#revtool.cwl/output"},
    "target": {"@id": "packed.cwl#sorttool.cwl/input"}
}
simleo commented 1 year ago

See proposal in https://github.com/ResearchObject/ro-terms/pull/12. I changed the property's name from parameterConnections to connections and its domain from HowToStep to ComputationalWorkflow because:

simleo commented 1 year ago

Also changed source to sourceParameter and target to targetParameter: they're more specific and there is no clash with http://schema.org/target.

simleo commented 1 year ago

The problem with https://github.com/ResearchObject/ro-terms/pull/12 (implemented in #29) is that all links to parameter connections are in the workflow. While this makes sense, it may not be enough to derive the actual data flow, especially when a tool is reused in different steps.

As an example, consider this workflow for getting a sorted list of top level domains given a list of hostnames as input. It uses a CWL equivalent of the classic rev | cut -f 1 | rev trick to work around cut's inability to select the last field:

graph

The connections for this workflow are:

{
    "@id": "#pc1",
    "@type": "ParameterConnection",
    "sourceParameter": {"@id": "packed.cwl#revtool.cwl/rev_out"},
    "targetParameter": {"@id": "packed.cwl#cuttool.cwl/cut_in"}
},
{
    "@id": "#pc2",
    "@type": "ParameterConnection",
    "sourceParameter": {"@id": "packed.cwl#main/hostnames"},
    "targetParameter": {"@id": "packed.cwl#revtool.cwl/rev_in"}
},
{
    "@id": "#pc3",
    "@type": "ParameterConnection",
    "sourceParameter": {"@id": "packed.cwl#cuttool.cwl/cut_out"},
    "targetParameter": {"@id": "packed.cwl#revtool.cwl/rev_in"}
},
{
    "@id": "#pc4",
    "@type": "ParameterConnection",
    "sourceParameter": {"@id": "packed.cwl#main/reverse_sort"},
    "targetParameter": {"@id": "packed.cwl#sorttool.cwl/reverse"}
},
{
    "@id": "#pc5",
    "@type": "ParameterConnection",
    "sourceParameter": {"@id": "packed.cwl#revtool.cwl/rev_out"},
    "targetParameter": {"@id": "packed.cwl#sorttool.cwl/sort_in"}
},
{
    "@id": "#pc6",
    "@type": "ParameterConnection",
    "sourceParameter": {"@id": "packed.cwl#sorttool.cwl/sort_out"},
    "targetParameter": {"@id": "packed.cwl#main/tlds"}
}

Suppose a consumer tries to build the workflow's diagram with this information. Since order is not guaranteed, connections might be processed as #pc1, #pc5, #pc2, #pc3, #pc4, #pc6. This leads to the same revtool-executing step being linked to both cuttool and sorttool. Only when processing #pc3 the consumer realizes that there must be another revtool-executing step, since connecting to the existing one would lead to a cycle. The resulting diagram is:

graph-bad-connection

which is a different workflow that computes an entirely different output.

To avoid this problem, we should add connection to the relevant HowToStep instances. Note that we need to retain the ability to place them in ComputationalWorkflow as well, since some languages (e.g. CWL) allow passthrough links with no steps involved:

graph