common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
335 stars 231 forks source link

cached cwl.output.json contains docker file path references #1573

Open tschoonj opened 2 years ago

tschoonj commented 2 years ago

Hi all,

I am experimenting with cwl.output.json to get the results back from a CommandLineTool that executes in a Docker environment. This works fine, but there appears to be a problem when re-running the same workflow: cwltool correctly recognizes that the cache can be used, but it chokes on the filepath that was saved into the cwl.output.json file which contains a path to a random generated folder that was used during the first run.

Expected Behavior

Caching should work fine, as expected

Actual Behavior

When retrying, I get the following error:

cwltool --outdir output --cachedir cache spike.cwl spike.yaml
INFO /usr/local/miniforge3/bin/cwltool 3.1.20211107152837
INFO Resolved 'spike.cwl' to 'file:///home/tom/gitlab/cwl-workflows/workflows/spike.cwl'
spike.cwl:8:3: Warning: checking item
                      Warning:   Field `class` contains undefined reference to
                      `http://commonwl.org/cwltool#Secrets`
INFO spike.cwl:8:3: Unknown hint http://commonwl.org/cwltool#Secrets
INFO [workflow ] start
INFO [workflow ] starting step arv_get
INFO [step arv_get] start
INFO [job arv_get] Using cached output in /home/tom/gitlab/cwl-workflows/workflows/cache/d04184a5b32119f4058d7e8fbc6ff511
ERROR Workflow error, try again with --debug for more information:
Output file path /cByOZc/ubuntu.sif must be within designated output directory (/nnKPqR) or an input file pass through.

The initial run produced:

INFO /usr/local/miniforge3/bin/cwltool 3.1.20211107152837
INFO Resolved 'spike.cwl' to 'file:///home/tom/gitlab/cwl-workflows/workflows/spike.cwl'
spike.cwl:8:3: Warning: checking item
                      Warning:   Field `class` contains undefined reference to
                      `http://commonwl.org/cwltool#Secrets`
INFO spike.cwl:8:3: Unknown hint http://commonwl.org/cwltool#Secrets
INFO [workflow ] start
INFO [workflow ] starting step arv_get
INFO [step arv_get] start
INFO [job arv_get] Output of job will be cached in /home/tom/gitlab/cwl-workflows/workflows/cache/d04184a5b32119f4058d7e8fbc6ff511
INFO [job arv_get] /home/tom/gitlab/cwl-workflows/workflows/cache/d04184a5b32119f4058d7e8fbc6ff511$ docker \
    run \
    -i \
    --mount=type=bind,source=/home/tom/gitlab/cwl-workflows/workflows/cache/d04184a5b32119f4058d7e8fbc6ff511,target=/cByOZc \
    --mount=type=bind,source=/tmp/0n3iy0j2,target=/tmp \
    --workdir=/cByOZc \
    --read-only=true \
    --user=1002:1002 \
    --rm \
    --cidfile=/tmp/z5bv8n5s/20211207145621-197335.cid \
    --env=TMPDIR=/tmp \
    --env=HOME=/cByOZc \
    arv-cli:build-tar-fd145ede211e86f23f7aeab39e45de43 \
    arv-get-cwl
INFO [job arv_get] Max memory used: 47MiB
INFO [job arv_get] completed success
INFO [step arv_get] completed success
INFO [workflow ] completed success
{
    "collection_file": [
        {
            "class": "File",
            "basename": "ubuntu.sif",
            "location": "file:///home/tom/gitlab/cwl-workflows/workflows/output/ubuntu.sif",
            "checksum": "sha1$8a13313f5de5ace0d943ff7a3257fc83c0538829",
            "size": 27742208,
            "path": "/home/tom/gitlab/cwl-workflows/workflows/output/ubuntu.sif"
        }
    ]
}
INFO Final process status is success

Workflow Code

CommandLineTool arv-get.cwl:

cwlVersion: v1.2
class: CommandLineTool

requirements:
  DockerRequirement:
    dockerPull: arv-cli
  NetworkAccess:
    networkAccess: true
  InitialWorkDirRequirement:
    listing:
      - entryname: cwl.inputs.json
        entry: '{"inputs": $(inputs)"}'

baseCommand:
  - arv-get-cwl

inputs:
  arvados_collection_locator: string
  arvados_api_token: string
  arvados_api_host: string

outputs:
  collection_file: File

The arv-get-cwl script within the container extracts the input from cwl.inputs.json and passes it to the arv-get command, after which the cwl.output.json file is produced with the filename:

cat > ${outputfile} <<EOL
{
  "collection_file": {
    "path": "${download_destination}",
    "class": "File"
  }
}
EOL

Workflow spike.cwl:

cwlVersion: v1.2
class: Workflow

$namespaces:
  cwltool: "http://commonwl.org/cwltool#"

hints:
  "cwltool:Secrets":
    secrets: [arvados_api_token]

requirements:
  InlineJavascriptRequirement: {}
  ScatterFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  MultipleInputFeatureRequirement: {}

inputs:
  arvados_input_collection_locators: string[]
  arvados_output_collection_name: string
  arvados_api_host: string
  arvados_api_token: string

outputs:
  collection_file:
    type: File[]
    outputSource: arv_get/collection_file

steps:
  arv_get:
    run: arv-get.cwl
    scatter: arvados_collection_locator
    in:
      arvados_api_token: arvados_api_token
      arvados_api_host: arvados_api_host
      arvados_collection_locator: arvados_input_collection_locators
    out:
      - collection_file

Your Environment

CC @jrandall

tetron commented 2 years ago

As a workaround, it might work if you use relative paths in the cwl.output.json .

For the general case, cwltool would probably need to apply reverse path mapping to cwl.output.json to get the paths outside the container.