Design option: wrap datalad inside CWL

The original thinking was to create a new API layer around a new run-record with a new executioner (special remote). It is worth taking a step back and reevaluate the encapsulation with CWL having entered the picture. Rationale: If we adopt CWL, we might as well make it maximally useful, and not just an internal tool.

A main attraction of CWL is that it is its own ecosystem, and connecting rather than reimplementing is good. Having compute instruction defined as CWL "steps" that could be linked into larger workflows and executed (outside datalad) via standard batch systems would be great. In such a scenario, we would need to make sure that the versioning precision and data provisioning capabilities of datalad remain available.

One way to achieve this would be to have a dedicated provisioning workflow step. It would use a dedicated tool to create a suitable working environment for a subsequent payload computation step. This datalad-tool could obtain/checkout/pre-populate a dataset from any supported source/identifier. And then hand-over to the next step, in which standard CWL input types like File make sense, and are sufficient.

This also had the advantage that also the provisioning provenance would be automatically captured.

It would also be a "loose" CWL integration because using any of these tools inside or outside CWL makes sense, and is possible without making either system aware of the other.

There is also no need to have an exclusive datalad-tool as the provisioning solution. It would be perfectly fine to have a series of git-clone/annex-init/annex-get commands.

A remake special remote could then make smart decisions. It could

run cwltool directly on the worktree of a dataset (whenever it has the right version and all needed content present)
auto-generate a workflow that uses a provisioning helper to build an adequate worktree for a CWL payload-workflow to run

datalad / datalad-remake

Design option: wrap datalad inside CWL #9