Closed simleo closed 1 year ago
We need something like:
{
"@id": "packed.cwl#main/sorted",
"@type": "HowToStep",
"position": "1",
"workExample": {"@id": "packed.cwl#sorttool.cwl"},
"parameterConnections": [
{"@id": "#pc1"},
...
]
},
{
"@id": "#pc1",
"@type": "ParameterConnection",
"source": {"@id": "packed.cwl#revtool.cwl/output"},
"target": {"@id": "packed.cwl#sorttool.cwl/input"}
}
See proposal in https://github.com/ResearchObject/ro-terms/pull/12. I changed the property's name from parameterConnections
to connections
and its domain from HowToStep
to ComputationalWorkflow
because:
HowToStep
Also changed source
to sourceParameter
and target
to targetParameter
: they're more specific and there is no clash with http://schema.org/target.
The problem with https://github.com/ResearchObject/ro-terms/pull/12 (implemented in #29) is that all links to parameter connections are in the workflow. While this makes sense, it may not be enough to derive the actual data flow, especially when a tool is reused in different steps.
As an example, consider this workflow for getting a sorted list of top level domains given a list of hostnames as input. It uses a CWL equivalent of the classic rev | cut -f 1 | rev
trick to work around cut
's inability to select the last field:
The connections for this workflow are:
{
"@id": "#pc1",
"@type": "ParameterConnection",
"sourceParameter": {"@id": "packed.cwl#revtool.cwl/rev_out"},
"targetParameter": {"@id": "packed.cwl#cuttool.cwl/cut_in"}
},
{
"@id": "#pc2",
"@type": "ParameterConnection",
"sourceParameter": {"@id": "packed.cwl#main/hostnames"},
"targetParameter": {"@id": "packed.cwl#revtool.cwl/rev_in"}
},
{
"@id": "#pc3",
"@type": "ParameterConnection",
"sourceParameter": {"@id": "packed.cwl#cuttool.cwl/cut_out"},
"targetParameter": {"@id": "packed.cwl#revtool.cwl/rev_in"}
},
{
"@id": "#pc4",
"@type": "ParameterConnection",
"sourceParameter": {"@id": "packed.cwl#main/reverse_sort"},
"targetParameter": {"@id": "packed.cwl#sorttool.cwl/reverse"}
},
{
"@id": "#pc5",
"@type": "ParameterConnection",
"sourceParameter": {"@id": "packed.cwl#revtool.cwl/rev_out"},
"targetParameter": {"@id": "packed.cwl#sorttool.cwl/sort_in"}
},
{
"@id": "#pc6",
"@type": "ParameterConnection",
"sourceParameter": {"@id": "packed.cwl#sorttool.cwl/sort_out"},
"targetParameter": {"@id": "packed.cwl#main/tlds"}
}
Suppose a consumer tries to build the workflow's diagram with this information. Since order is not guaranteed, connections might be processed as #pc1, #pc5, #pc2, #pc3, #pc4, #pc6
. This leads to the same revtool-executing step being linked to both cuttool and sorttool. Only when processing #pc3
the consumer realizes that there must be another revtool-executing step, since connecting to the existing one would lead to a cycle. The resulting diagram is:
which is a different workflow that computes an entirely different output.
To avoid this problem, we should add connection
to the relevant HowToStep
instances. Note that we need to retain the ability to place them in ComputationalWorkflow
as well, since some languages (e.g. CWL) allow passthrough links with no steps involved:
Knowing how workflow parameters were passed to individual tools is important to find out how they affected the outputs.
We are currently linking workflow and tool parameters with connectedTo from the source tool / workflow to the target tool / workflow. For instance, in revsort:
we currently have:
but that's inaccurate, since such links only exist within the
revsort
workflow.packed.cwl#revtool.cwl
andpacked.cwl#sorttool.cwl
represent standalone software tools that happen to be connected this way inrevsort
, but might be used differently in another workflow.