datalad / datalad-remake

Other
0 stars 0 forks source link

Define specification for compute instructions #5

Open mih opened 4 months ago

mih commented 4 months ago

This can be thought of as the next iteration on the datalad run record format. This established format uses one commit/record to capture one computation that can produce any number of annex keys.

A primary objective here is to design a specification that can support computing any number of annex-key's, individually, without requiring one commit/record per key (think: datasets for a large number of files that can be computed in some structured fashion, individually(.

The key to this is likely going to be a parameterizable instruction set. @mih added basic support for this to the run machinery in https://github.com/datalad/datalad/pull/6424; see http://docs.datalad.org/en/stable/design/provenance_capture.html#placeholders-in-commands-and-io-specifications

If this is the path, a specification needs to consider two components:

Instruction template

The closest established concept in datalad is a side-car run-record (see http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record). However, this format needs a revision. A few pointers for candidate developments are

Moreover, the side-car record is using a content-based filename. Here we need to identify the instruction template somehow, but we also want to be able to edit/fix an instruction template without having to fix all references to it. See https://github.com/psychoinformatics-de/datalad-remake/issues/2

It would make sense to use development from https://concepts.datalad.org in a revision of the run-record format. Rather than be completely implicitly defined, we can offer a user the ability to record semantics of parameters in the fashion of Property in the https://concepts.datalad.org/s/thing/unreleased/ schema.

But see https://github.com/psychoinformatics-de/datalad-remake/issues/1 for a readily available specification (and see CWL section below).

(Per annex key) Parameter set

Here we need to find a format and place to store parameters. See https://github.com/psychoinformatics-de/datalad-remake/issues/4 for a dedicated issue.

CWL-based solution

A fully defined compute instruction is a two-step CWL workflow linked to the necessary inputs. Input declaration can be linked to a workflow definition to form a single, joint record (see https://github.com/datalad/datalad-remake/issues/7#issue-2274771335 cp.inputs.yaml for an example).

The inputs are the specification of the working environment needed to perform a computation (ie. the parameters to remake-provision https://github.com/datalad/datalad-remake/issues/12), plus any parameters of the actual computation (non-file arguments, association of provisioned files to workflow arguments).

In order to get a complete record for producing a single key, we need a declaration that identify the key in the workflow output, based on some workflow output values (e.g. output dir plus relpath etc.). This is not (necessarily related to https://github.com/datalad/datalad-remake/issues/13, because in a special remote implementation we need to capture such an output in a dataset, but only serve it to a temporary location given by git-annex.