In order to make datalad play well with workflow orchestrators (and descriptions), it would be use to implement two new components that can by used to implement a data source and a sink (separately).
Source
This is a command that can take a source data identification, and provisions that referenced data in a particular way. Relevant scenarios could be:
Clone dataset from url, and provision work at a given commit
Same as above, and also obtain a selected subset of file content
Provide a set of annexkeys, each with a custom filename in a directory
Obtain an annex repository and checkout a custom metadata-driven view
Importantly, the output of a provisioned data source need not be a fully initialized checkout of a datalad dataset. It is perfectly in scope to generate just a bunch of files that are subjected to different, workflow-internal transport mechanism (think distributing compute jobs on a cluster without a shared file system).
According to https://www.commonwl.org/v1.2/CommandLineTool.html#Output_binding it should be possible to generate a detailed output list for a CWL -compliant implementation to pick up, verify and use for feeding subsequent processing steps. The parameterization of the data source tool should allow for a meaningful level of detail (including named arguments?).
Sink
The purpose of a sink would be to (re)inject workflow outputs into a dataset. Again, different scenarios can be relevant:
Modify a given checkout of a repository
Also save/commit the changes (to a different given branch)
Also push to a configured/configurable remote (may need a lockfile as an optional input to support execution in distributed/concurrent workflows)
We may need a way to declare a specific output file/dir name that is different from the name the workflow output natively has.
It would be instrumental if not only workflow outputs could be "sink'ed", but also workflow execution provenance.
Impact
Having proper implementation for these components has the potential to make large (if not all) custom implementations of the https://github.com/psychoinformatics-de/fairly-big-processing-workflow obsolete. This would mean that rather than having dedicated adaptors for individual batch systems, a standard workflow/submission generator could be fed, where data sources/sinks are just "nodes" that bound to the same execution environment as the main compute step(s) -- possibly automatically replicated for any number of compute nodes.
Relevance for remake special remote
Source and sink could also be the low-level tooling for the implementation of this special remote. We would know what workflow to run to (re)compute, we can generate a data source step, and we can point a sink to the location where git-annex expects the key to appear. The actual computation could then be performed by any CWL-compliant implementation. Importantly, computations would not have to depend on datalad-based data sources, or datalad-captured/provided, somehow special workflows. They would be able to work with any workflow from any source.
It should be possible that a special-remote based computation works like this:
lookup instructions (A) for the requested key
lookup workflow specification (B) based on name declared in (A) and the version of the dataset (if given)
put (A) and (B) in a working directory and execute via CWL-implementation
For the last step to be sufficient and conclusive, it (A) needs to have a sink parameterization that produces the one requested key.
If one workflow execution produces additional keys also requested (a special remote would not know due to the way the special remote protocol works (right now)), they can be harvested somewhat efficiently by caching the (intermediate) workflow execution environments, and rerunning them with updated data sinks. Caching would be relatively simple, because we have all input parameters (including versions) fully defined, and we can tell exactly when a workflow is re-executed in an identical fashion -- and I assume any efficient CWL implementation does make such decisions too.
This is taking the basic idea from https://github.com/datalad/datalad-remake/issues/9 and refining it a bit more.
The concept of having some kind of data source that feeds a computational workflow and a "sink" that excepts outcomes for storage is common. For example, NiPyPE supports a whole range of them https://nipype.readthedocs.io/en/latest/api/generated/nipype.interfaces.io.html
In order to make datalad play well with workflow orchestrators (and descriptions), it would be use to implement two new components that can by used to implement a data source and a sink (separately).
Source
This is a command that can take a source data identification, and provisions that referenced data in a particular way. Relevant scenarios could be:
url
, and provision work at a givencommit
annexkey
s, each with a custom filename in a directoryImportantly, the output of a provisioned data source need not be a fully initialized checkout of a datalad dataset. It is perfectly in scope to generate just a bunch of files that are subjected to different, workflow-internal transport mechanism (think distributing compute jobs on a cluster without a shared file system).
According to https://www.commonwl.org/v1.2/CommandLineTool.html#Output_binding it should be possible to generate a detailed output list for a CWL -compliant implementation to pick up, verify and use for feeding subsequent processing steps. The parameterization of the data source tool should allow for a meaningful level of detail (including named arguments?).
Sink
The purpose of a sink would be to (re)inject workflow outputs into a dataset. Again, different scenarios can be relevant:
We may need a way to declare a specific output file/dir name that is different from the name the workflow output natively has.
It would be instrumental if not only workflow outputs could be "sink'ed", but also workflow execution provenance.
Impact
Having proper implementation for these components has the potential to make large (if not all) custom implementations of the https://github.com/psychoinformatics-de/fairly-big-processing-workflow obsolete. This would mean that rather than having dedicated adaptors for individual batch systems, a standard workflow/submission generator could be fed, where data sources/sinks are just "nodes" that bound to the same execution environment as the main compute step(s) -- possibly automatically replicated for any number of compute nodes.
Relevance for remake special remote
Source and sink could also be the low-level tooling for the implementation of this special remote. We would know what workflow to run to (re)compute, we can generate a data source step, and we can point a sink to the location where git-annex expects the key to appear. The actual computation could then be performed by any CWL-compliant implementation. Importantly, computations would not have to depend on datalad-based data sources, or datalad-captured/provided, somehow special workflows. They would be able to work with any workflow from any source.
It should be possible that a special-remote based computation works like this:
For the last step to be sufficient and conclusive, it (A) needs to have a sink parameterization that produces the one requested key.
If one workflow execution produces additional keys also requested (a special remote would not know due to the way the special remote protocol works (right now)), they can be harvested somewhat efficiently by caching the (intermediate) workflow execution environments, and rerunning them with updated data sinks. Caching would be relatively simple, because we have all input parameters (including versions) fully defined, and we can tell exactly when a workflow is re-executed in an identical fashion -- and I assume any efficient CWL implementation does make such decisions too.