Build demo mapping of a datalad-run record as a CWL CommandLineTool

Here is the target spec: https://www.commonwl.org/v1.2/CommandLineTool.html

The source is much simpler: http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record

The challenge is that the datalad run record is a combination of three things that are recognized as separate entities in the CWL world:

workflow/command line tool specification
workflow inputs
workflow execution provenance

Following the cwltool documentation, the first two can be linked to form a single execution specification:

positional arguments:
  cwl_document
          path or URL to a CWL Workflow, CommandLineTool, or ExpressionTool.
          If the `inputs_object` has a `cwl:tool` field indicating
          the path or URL to the cwl_document, then the `cwl_document`
          argument is optional.
  inputs_object
         path or URL to a YAML or JSON formatted description of the required input
         values for the given `cwl_document`.

Here is a demo of that:

cp.cwl.yaml

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cp, -v]
inputs:
  src:
    type: File
    inputBinding:
      position: 1
  dstpath:
    type: string
    inputBinding:
      position: 2
outputs:
  dst:
    type: File
    outputBinding:
      glob: $(inputs.dstpath)

cp.inputs.yaml

cwl:tool: cp.cwl.yaml # this is the key bit
src:
  class: File
  path: input.txt
dstpath: output.txt

This can be executed as one instruction set

❯ cwltool cp.inputs.yaml
INFO /usr/bin/cwltool 3.1.20240404144621
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [job cp.cwl.yaml] /tmp/pi5sa5fc$ cp \
    -v \
    /tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt \
    output.txt
'/tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success

Now this can be taken a step further. With cwlprov https://github.com/common-workflow-language/cwlprov we can have an instant PROV record as a BagIt

❯ cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO /home/mih/env/datalad-dev/bin/cwltool 3.1.20240404144621
INFO [cwltool] /home/mih/env/datalad-dev/bin/cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [provenance] Adding to RO file:///tmp/cwl/some/input.txt
INFO [job cp.cwl.yaml] /tmp/_5kqmfwq$ cp \
    -v \
    /tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt \
    output.txt
'/tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
/home/mih/env/datalad-dev/lib/python3.11/site-packages/rdflib/plugins/serializers/nt.py:40: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success
INFO [provenance] Finalizing Research Object
INFO [provenance] Research Object saved to /tmp/cwl/some/prov.out

Leading to

❯ tree -a prov.out
prov.out
├── bag-info.txt
├── bagit.txt
├── data
│   ├── ad
│   │   └── adbe6c7d3c0d8b19ecd492bec9532c13a6e1c9ad
│   └── b6
│       └── b63c7c3a7543014bd34d99d31a85606d485837f9
├── manifest-sha1.txt
├── metadata
│   ├── logs
│   │   └── engine.a844e9af-9c50-4208-be9f-76db7579c11b.txt
│   ├── manifest.json
│   └── provenance
│       ├── primary.cwlprov.json
│       ├── primary.cwlprov.jsonld
│       ├── primary.cwlprov.nt
│       ├── primary.cwlprov.provn
│       ├── primary.cwlprov.ttl
│       └── primary.cwlprov.xml
├── snapshot
│   └── cp.cwl.yaml
├── tagmanifest-sha1.txt
├── tagmanifest-sha256.txt
├── tagmanifest-sha512.txt
└── workflow
    ├── packed.cwl
    ├── primary-job.json
    └── primary-output.json

9 directories, 20 files

Does this have all information from a datalad run-record?

[x] cmd is comprehensively captured in the workflow declaration
[x] inputs in the workflow inputs, much more detailed. Also in prov.out/workflow/primary-job.json (careful with absolute file:// URL)
[x] outputs see prov.out/workflow/primary-output.json
[x] dsid is absent, CWL has no concept of this. a related "associated with dataset" property an be defined easily. But with https://github.com/datalad/datalad-remake/issues/12 the dsid could even become an explicit workflow parameter
[x] exit is not recorded verbatim, but CWL allows for labeling exit codes into success, temporary failure and permanent failure. Although this reduced information, it is also more flexible (not every non-zero is a problem), and also enables decision-making
[ ] pwd in the prov output everything is recoded to match the organization of the bagit, which includes its own data hashtree.

So not everything is readily available in the right format, but missing bits can be added easily.

Going with the bagit as main/only output format seems unnecessarily complex. With a datalad dataset we can capture most/all info without taking apart the dataset worktree.

datalad / datalad-remake

Build demo mapping of a datalad-run record as a CWL CommandLineTool #7