datalad / datalad-remake

Other
2 stars 3 forks source link

Historic (prov-record) vs up-to-date compute instructions #2

Open mih opened 6 months ago

mih commented 6 months ago

There are two principle use cases for a compute instruction record

The current implementation of run/rerun in datalad tries to ignore that these are two different things, and documents in a format that aims to be re-executable. However, this has problems:

We need to come up with a specification that supports both cases equally well. This possibly means:

CWL-based solution

With https://github.com/datalad/datalad-remake/issues/10 we would factor a specification into three components

A "historic" record needs to capture all three verbatim. This would be easy, because the would appear in the form of a modular CWL workflow with three steps, represented would three CWL sub-workflows (or rather command line tool invocation): data provisioning, compute, data capture. Together they become part of the commit that captures the outcome (just like a traditional run record, but more modular).

Theoretically each of the three components can be "fixed up", and would need that whenever any API of an underlying tool changes.

Importantly, (1) and (3) will be more likely to change (they would need to track datalad evolution and run on the client-system natively, while (2) could be implemented in some more "static" fashion via a container-based execution.

When rerunning historic records it should be possible to provide updated workflow step definitions. It may be meaningful to employ an updatable approach from the beginning. Something like:

Now, when a workflow is written out (by rerun, or by the special remote handler) we can also apply updates to the data provisioning and capture steps, possibly replacing them entirely, informed by previous configuration). However, we would not write such updates back to the dataset, but instead maintain the pristine original record. Upgrades are applied on the fly, each time.