Open mih opened 6 months ago
Maybe worth an addendum: apart from --since
, the closest thing to a workflow management with current datalad rerun
is the procedure described in Handbook's subsection 5.1.4.2 (DVC comparison): execute a series of datalad run
commands, tag important steps, re-run a range of commits (optionally creating a new branch) with datalad rerun --branch foo start-tag..end-tag
.
Status quo
datalad-run
(as shipped with datalad v1) is executing an opaque, single-step workflow that is defined by a sequence of strings that ultimately given to Python'ssubprocess
for execution. The command supports place-holder expansion for the strings with a few commonly defined variables, and any number of custom definitions that are evaluated at runtime.As an execution precondition,
datalad-run
requires a checkout of a Git repository to run. The--input
parameter is used to guarantee the presence of select annex'ed files. The--output
parameter is used to ensure that particular files can be written to.From a data sink perspective again a checkout of a Git repository is required. Typically all workflow outputs in the working directory are committed to that worktree as a new commit. A partial capture can be achieved via a combination of
--output
and--explicit
.Possible re-envisioning
Running a future
datalad-run
couldHEAD
of the worktree.--input
is given, the respective items are added to the specification--output
is given, those are made writabledatalad-run --explicit --assume-ready both <cmd>
to get the placeholder expansionThe actual execution could also take place in a temporary clone/worktree without an issue.
A
datalad rerun
would be very similar. It would keep the workflow discovery (--since
), and either execute sequentially on top ofHEAD
, or a different branch (i.e., adjusted data sink).Why blow up a perfectly simple implementation with a complex CWL dependency?
That is not necessary. What the above sketch boils down to is a refactoring. Rather than having one monolithic
run
, we factor out provisioning, and output capture, from the execution. Whether we run the two respective commands directly, or via CWL does not matter much. However, the resulting new helpers would also become available in a CWL-context (and forremake
, thereby increasing usage.