datalad / datalad-remake

Other
0 stars 0 forks source link

Build demo mapping of a datalad-run record as a CWL CommandLineTool #7

Closed mih closed 3 months ago

mih commented 4 months ago

Here is the target spec: https://www.commonwl.org/v1.2/CommandLineTool.html

The source is much simpler: http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record

The challenge is that the datalad run record is a combination of three things that are recognized as separate entities in the CWL world:

Following the cwltool documentation, the first two can be linked to form a single execution specification:

positional arguments:
  cwl_document
          path or URL to a CWL Workflow, CommandLineTool, or ExpressionTool.
          If the `inputs_object` has a `cwl:tool` field indicating
          the path or URL to the cwl_document, then the `cwl_document`
          argument is optional.
  inputs_object
         path or URL to a YAML or JSON formatted description of the required input
         values for the given `cwl_document`.

Here is a demo of that:

cp.cwl.yaml

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cp, -v]
inputs:
  src:
    type: File
    inputBinding:
      position: 1
  dstpath:
    type: string
    inputBinding:
      position: 2
outputs:
  dst:
    type: File
    outputBinding:
      glob: $(inputs.dstpath)

cp.inputs.yaml

cwl:tool: cp.cwl.yaml # this is the key bit
src:
  class: File
  path: input.txt
dstpath: output.txt

This can be executed as one instruction set

❯ cwltool cp.inputs.yaml
INFO /usr/bin/cwltool 3.1.20240404144621
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [job cp.cwl.yaml] /tmp/pi5sa5fc$ cp \
    -v \
    /tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt \
    output.txt
'/tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success

Now this can be taken a step further. With cwlprov https://github.com/common-workflow-language/cwlprov we can have an instant PROV record as a BagIt

❯ cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO /home/mih/env/datalad-dev/bin/cwltool 3.1.20240404144621
INFO [cwltool] /home/mih/env/datalad-dev/bin/cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [provenance] Adding to RO file:///tmp/cwl/some/input.txt
INFO [job cp.cwl.yaml] /tmp/_5kqmfwq$ cp \
    -v \
    /tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt \
    output.txt
'/tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
/home/mih/env/datalad-dev/lib/python3.11/site-packages/rdflib/plugins/serializers/nt.py:40: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success
INFO [provenance] Finalizing Research Object
INFO [provenance] Research Object saved to /tmp/cwl/some/prov.out

Leading to

❯ tree -a prov.out
prov.out
├── bag-info.txt
├── bagit.txt
├── data
│   ├── ad
│   │   └── adbe6c7d3c0d8b19ecd492bec9532c13a6e1c9ad
│   └── b6
│       └── b63c7c3a7543014bd34d99d31a85606d485837f9
├── manifest-sha1.txt
├── metadata
│   ├── logs
│   │   └── engine.a844e9af-9c50-4208-be9f-76db7579c11b.txt
│   ├── manifest.json
│   └── provenance
│       ├── primary.cwlprov.json
│       ├── primary.cwlprov.jsonld
│       ├── primary.cwlprov.nt
│       ├── primary.cwlprov.provn
│       ├── primary.cwlprov.ttl
│       └── primary.cwlprov.xml
├── snapshot
│   └── cp.cwl.yaml
├── tagmanifest-sha1.txt
├── tagmanifest-sha256.txt
├── tagmanifest-sha512.txt
└── workflow
    ├── packed.cwl
    ├── primary-job.json
    └── primary-output.json

9 directories, 20 files

Does this have all information from a datalad run-record?

So not everything is readily available in the right format, but missing bits can be added easily.

Going with the bagit as main/only output format seems unnecessarily complex. With a datalad dataset we can capture most/all info without taking apart the dataset worktree.

mih commented 3 months ago

Closing. Continued in https://github.com/datalad/datalad-remake/issues/14