datalad / datalad-remake

Other
2 stars 3 forks source link

Implement datalad data source/sink #10

Open mih opened 6 months ago

mih commented 6 months ago

This is taking the basic idea from https://github.com/datalad/datalad-remake/issues/9 and refining it a bit more.

The concept of having some kind of data source that feeds a computational workflow and a "sink" that excepts outcomes for storage is common. For example, NiPyPE supports a whole range of them https://nipype.readthedocs.io/en/latest/api/generated/nipype.interfaces.io.html

In order to make datalad play well with workflow orchestrators (and descriptions), it would be use to implement two new components that can by used to implement a data source and a sink (separately).

Source

This is a command that can take a source data identification, and provisions that referenced data in a particular way. Relevant scenarios could be:

Importantly, the output of a provisioned data source need not be a fully initialized checkout of a datalad dataset. It is perfectly in scope to generate just a bunch of files that are subjected to different, workflow-internal transport mechanism (think distributing compute jobs on a cluster without a shared file system).

According to https://www.commonwl.org/v1.2/CommandLineTool.html#Output_binding it should be possible to generate a detailed output list for a CWL -compliant implementation to pick up, verify and use for feeding subsequent processing steps. The parameterization of the data source tool should allow for a meaningful level of detail (including named arguments?).

Sink

The purpose of a sink would be to (re)inject workflow outputs into a dataset. Again, different scenarios can be relevant:

We may need a way to declare a specific output file/dir name that is different from the name the workflow output natively has.

It would be instrumental if not only workflow outputs could be "sink'ed", but also workflow execution provenance.

Impact

Having proper implementation for these components has the potential to make large (if not all) custom implementations of the https://github.com/psychoinformatics-de/fairly-big-processing-workflow obsolete. This would mean that rather than having dedicated adaptors for individual batch systems, a standard workflow/submission generator could be fed, where data sources/sinks are just "nodes" that bound to the same execution environment as the main compute step(s) -- possibly automatically replicated for any number of compute nodes.

Relevance for remake special remote

Source and sink could also be the low-level tooling for the implementation of this special remote. We would know what workflow to run to (re)compute, we can generate a data source step, and we can point a sink to the location where git-annex expects the key to appear. The actual computation could then be performed by any CWL-compliant implementation. Importantly, computations would not have to depend on datalad-based data sources, or datalad-captured/provided, somehow special workflows. They would be able to work with any workflow from any source.

It should be possible that a special-remote based computation works like this:

For the last step to be sufficient and conclusive, it (A) needs to have a sink parameterization that produces the one requested key.

If one workflow execution produces additional keys also requested (a special remote would not know due to the way the special remote protocol works (right now)), they can be harvested somewhat efficiently by caching the (intermediate) workflow execution environments, and rerunning them with updated data sinks. Caching would be relatively simple, because we have all input parameters (including versions) fully defined, and we can tell exactly when a workflow is re-executed in an identical fashion -- and I assume any efficient CWL implementation does make such decisions too.