There are two principle use cases for a compute instruction record
provenance record: how something was computed. Here, there primary aim is to document.
specify has something can be computed
The current implementation of run/rerun in datalad tries to ignore that these are two different things, and documents in a format that aims to be re-executable. However, this has problems:
when a prior-record no longer works, because the necessary environment is no longer available and the instructions have to be updated: we need a new record, but other than that there is no change to a dataset or individual keys to record that they are now provided via different means.
updating a record nevertheless means rewriting history
We need to come up with a specification that supports both cases equally well. This possibly means:
establish a dedicated prov-record (document-only) to be used in commits for the purpose of recording outcomes of datalad run
develop the concept of compute instructions as a (library of) template for executing code with prov-capture.
A "historic" record needs to capture all three verbatim. This would be easy, because the would appear in the form of a modular CWL workflow with three steps, represented would three CWL sub-workflows (or rather command line tool invocation): data provisioning, compute, data capture.
Together they become part of the commit that captures the outcome (just like a traditional run record, but more modular).
Theoretically each of the three components can be "fixed up", and would need that whenever any API of an underlying tool changes.
Importantly, (1) and (3) will be more likely to change (they would need to track datalad evolution and run on the client-system natively, while (2) could be implemented in some more "static" fashion via a container-based execution.
When rerunning historic records it should be possible to provide updated workflow step definitions. It may be meaningful to employ an updatable approach from the beginning. Something like:
The three-step workflow is read, and upgraded to the recent CWL version, which is written to a temp location
Only then CWL runs
Now, when a workflow is written out (by rerun, or by the special remote handler) we can also apply updates to the data provisioning and capture steps, possibly replacing them entirely, informed by previous configuration). However, we would not write such updates back to the dataset, but instead maintain the pristine original record. Upgrades are applied on the fly, each time.
There are two principle use cases for a compute instruction record
The current implementation of
run/rerun
in datalad tries to ignore that these are two different things, and documents in a format that aims to be re-executable. However, this has problems:We need to come up with a specification that supports both cases equally well. This possibly means:
datalad run
CWL-based solution
With https://github.com/datalad/datalad-remake/issues/10 we would factor a specification into three components
A "historic" record needs to capture all three verbatim. This would be easy, because the would appear in the form of a modular CWL workflow with three steps, represented would three CWL sub-workflows (or rather command line tool invocation): data provisioning, compute, data capture. Together they become part of the commit that captures the outcome (just like a traditional run record, but more modular).
Theoretically each of the three components can be "fixed up", and would need that whenever any API of an underlying tool changes.
Importantly, (1) and (3) will be more likely to change (they would need to track datalad evolution and run on the client-system natively, while (2) could be implemented in some more "static" fashion via a container-based execution.
When rerunning historic records it should be possible to provide updated workflow step definitions. It may be meaningful to employ an updatable approach from the beginning. Something like:
Now, when a workflow is written out (by
rerun
, or by the special remote handler) we can also apply updates to the data provisioning and capture steps, possibly replacing them entirely, informed by previous configuration). However, we would not write such updates back to the dataset, but instead maintain the pristine original record. Upgrades are applied on the fly, each time.