DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

failure running cwl workflow referencing external file #1801

Closed evan-wehi closed 7 years ago

evan-wehi commented 7 years ago

This is the stacktrace:

Traceback (most recent call last):
  File "/home/thomas.e/.local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/home/thomas.e/.local/lib/python2.7/site-packages/toil/job.py", line 1289, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/home/thomas.e/.local/lib/python2.7/site-packages/toil/job.py", line 1234, in _run
    return self.run(fileStore)
  File "/home/thomas.e/.local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 301, in run
    existing=existing))
  File "/home/thomas.e/.local/lib/python2.7/site-packages/cwltool/process.py", line 183, in adjustFilesWithSecondary
    adjustFilesWithSecondary(rec[d], op)
  File "/home/thomas.e/.local/lib/python2.7/site-packages/cwltool/process.py", line 183, in adjustFilesWithSecondary
    adjustFilesWithSecondary(rec[d], op)
  File "/home/thomas.e/.local/lib/python2.7/site-packages/cwltool/process.py", line 177, in adjustFilesWithSecondary
    rec["path"] = op(rec["path"], primary=primary)
  File "/home/thomas.e/.local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 146, in getFile
    fileStoreID, fileName = fileTuple
ValueError: too many values to unpack

This is the failing workflow step:

 trim:
    run: tools/trimmomatic.cwl

    in:
      reads1:
        source: read1
        valueFrom: >
          ${
              self.format = "http://edamontology.org/format_1930";
              return self;
          }
      reads2:
        source: read2
        valueFrom: >
          ${
              self.format = "http://edamontology.org/format_1930";
              return self;
          }
      end_mode:
        default: PE
      nthreads:
        valueFrom: ${ return 2; }
      illuminaClip:
        default:
          adapters:
            class: File
            location: "/stornext/System/data/apps/trimmomatic/trimmomatic-0.36/adapters/TruSeq3-PE.fa"
            path: "/stornext/System/data/apps/trimmomatic/trimmomatic-0.36/adapters/TruSeq3-PE.fa"
          seedMismatches: 1
          palindromeClipThreshold: 20
          simpleClipThreshold: 20
          minAdapterLength: 4
          keepBothReads: "true"

    out: [output_log, reads1_trimmed, reads1_trimmed_unpaired, reads2_trimmed_paired, reads2_trimmed_unpaired]

Using print statements I know it is failing on the adapters file - there is no fileStoreId for this file because it is not in the filestore.

The workflow works with the CWL reference implementation.

mr-c commented 7 years ago

@evan-wehi As a workaround, what happens if adapters becomes a workflow level input with a default value? That would be more portable anyhow regardless.

evan-wehi commented 7 years ago

@mr-c I’ll try that and let you know. I agree this is not portable and should not be part of the workflow. But also it should not be part of the input. I’d like to isolate references and such in something like a site-config.yml but I haven’t had a chance to figure out how to do that in CWL.

mr-c commented 7 years ago

@evan-wehi I understand. This is where we get into the differences between "run a workflow" and "be a comprehensive workflow platform" start to become apparent. Toil is a great workflow runner, but doesn't (yet) try to be a comprehensive workflow platform.

evan-wehi commented 7 years ago

@mr-c

I've just spent two frustrating hours trying to move the adapter location into the input. Seems like it should be trivial but I can't figure it out.

Here is what I have:

in my input:

adaptersx:
  class: File
  path: "/stornext/System/data/apps/trimmomatic/trimmomatic-0.36/adapters/TruSeq3-PE.fa"

In the worflow input section:

inputs:
  read1: File
  read2: File
  adaptersx: File
  mouse-ref: string
  human-ref: string

The step itself:

 trim:
    run: tools/trimmomatic.cwl

    in:
      reads1:
        source: read1
        valueFrom: >
          ${
              self.format = "http://edamontology.org/format_1930";
              return self;
          }
      reads2:
        source: read2
        valueFrom: >
          ${
              self.format = "http://edamontology.org/format_1930";
              return self;
          }
      end_mode:
        default: PE
      nthreads:
        valueFrom: ${ return 2; }
      illuminaClip:
        default:
          adapters: adaptersx
          seedMismatches: 1
          palindromeClipThreshold: 20
          simpleClipThreshold: 20
          minAdapterLength: 4
          keepBothReads: "true"

    out: [output_log, reads1_trimmed, reads1_trimmed_unpaired, reads2_trimmed_paired, reads2_trimmed_unpaired]

trimmomatic.cwl is essentially the same as the one in the user contribution repo. The reference implementation reports:

$ cwltool pdx-pl.cwl pdx-inp.yml 
/home/thomas.e/.local/bin/cwltool 1.0.20170413194156
Resolved 'pdx-pl.cwl' to 'file:///stornext/Home/data/allstaff/t/thomas.e/dev/pdx-genome/test/pdx-pl.cwl'
No handlers could be found for logger "rdflib.term"
Exception on step 'trim'
[step trim] Cannot make job: Invalid job input record:
the `illuminaClip` field is not valid because
  tried illuminaClipping but
pdx-pl.cwl:134:11:     the `adapters` field is not valid because
pdx-pl.cwl:134:11:       is not a dict
[workflow pdx-pl.cwl] outdir is /home/thomas.e/tmp/tmprfKB6K
{}
Final process status is permanentFail
mr-c commented 7 years ago

@evan-wehi Upon further reflection, we don't have a way to create a custom record that references a particular input in CWL v1.0.1.

I've created an issue to track this at https://github.com/common-workflow-language/common-workflow-language/issues/493

Here's how to work around it via an ExpressionTool using this Trimmomatic CWL description

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow

requirements:
  SchemaDefRequirement:
    types:
      - $import: trimmomatic-illumina_clipping.yaml 

inputs:
  adapters:
    type: File
    default:
      class: File
      path: /stornext/System/data/apps/trimmomatic/trimmomatic-0.36/adapters/TruSeq3-PE.fa

steps:
  package_Illumina_clipping:
    in: { illumina_adapters: adapters }
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs: { illumina_adapters: File }
      expression: |
        ${ var illuminaClip = {
              "adapters": inputs.illumina_adapters,
              "seedMismatches": 1,
              "palindromeClipThreshold": 20,
              "simpleClipThreshold": 20,
              "minAdapterLength": 4,
              "keepBothReads": "true" };
           return { "illuminaClip": illuminaClip };
        }
      outputs: { illuminaClip: trimmomatic-illumina_clipping.yaml#illuminaClipping }
    out: [ illuminaClip ]

outputs:
  illuminaClip:
    type: trimmomatic-illumina_clipping.yaml#illuminaClipping
    outputSource: package_Illumina_clipping/illuminaClip

Not a very satisfactory solution, but that should work.

evan-wehi commented 7 years ago

@mr-c Thanks, I’ll try and give it a whirl tomorrow. Unfortunately, I can’t install working cwl reference implementation at the moment. (Issue opened in that repository)

Evan.

evan-wehi commented 7 years ago

@mr-c After several more hours I did get the trimmomatic step working. The workflow executes to completion with the reference implementation and the trimmomatic step executes with toil. I can't help wondering (without understanding the CWL design) why, essentially transporting a string to a command line, needs to be this complicated.

I know have this problem:

Failed to evaluate expression:
Expression evaluation error:

evalmachine.<anonymous>:61
  ret["secondaryFiles"] = [inputs.bamIndex];
                        ^
TypeError: Cannot set property 'secondaryFiles' of undefined
    at evalmachine.<anonymous>:61:25
    at evalmachine.<anonymous>:63:3
    at ContextifyScript.Script.runInContext (vm.js:35:29)
    at ContextifyScript.Script.runInNewContext (vm.js:41:15)
    at Object.runInNewContext (vm.js:93:38)
    at Socket.<anonymous> ([eval]:10:55)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
    at Socket.Readable.push (_stream_readable.js:134:10)

This expression appears in this step:

  #
  # Gather human bam and index into one dependency
  #
  gather:
    run:
      class: ExpressionTool

      inputs:
        bamFile:
          type: File
        bamIndex:
          type: File

      outputs:
        combined:
          type: File

      expression: >
        ${
          var ret = inputs.bamFile;
          ret["secondaryFiles"] = [inputs.bamIndex];
          return {"combined" : ret};
        }

    in:
      bamFile:
        source: sort-human/sorted
      bamIndex:
        source: index-human/index

    out: [combined]

which is a hack I picked up somewhere to make sure that a bam index travels along with the original bam file (the reference implementation requires to execute correctly).

Any insight would be much appreciated.

mr-c commented 7 years ago

Hey @evan-wehi

Thank you for your patience. Running trimmomatic is essentially like configuring another workflow system, but from the command line. As a tool it is an outlier and the source of much frustration across all workflow systems -- the Galaxy wrapper is 500 lines, for example: https://github.com/fls-bioinformatics-core/galaxy-tools/blob/master/tools/trimmomatic/trimmomatic.xml

That's an interesting hack. A simpler method is to output the BAM with the index as a secondaryFile in the step where the index is created: https://github.com/common-workflow-language/workflows/blob/master/tools/samtools-index.cwl#L23

evan-wehi commented 7 years ago

@mr-c

Several more hours and this bit is now work and I have in fact actually managed to get toil run the workflow to completion on some test data. Thank you for your help! I do dispute the claim that trimmomatic is an outlier. I find that everything in CWL is difficult and highly non-obvious.

One last thing (for this thread). The JS to build the illuminaClip object needs to be

var illuminaClip = {
              "adapters": inputs.adapters,
              "seedMismatches": 1,
              "palindromeClipThreshold": 20,
              "simpleClipThreshold": 20,
              "minAdapterLength": 4,
              "keepBothReads": true };
           return { "illuminaClip": illuminaClip };

for toil but

var illuminaClip = {
              "adapters": inputs.illumima_adapters,
              "seedMismatches": 1,
              "palindromeClipThreshold": 20,
              "simpleClipThreshold": 20,
              "minAdapterLength": 4,
              "keepBothReads": true };
           return { "illuminaClip": illuminaClip };

for the reference implementation. It would be nice if they could be the same.

mr-c commented 7 years ago

@evan-wehi I am truly sorry to hear that you've had such a frustrating time with CWL, the CWL reference implementation, and Toil.

The only difference in your JS is adapters": inputs.adapters, vs "adapters": inputs.illumima_adapters,, this suggests that there were other differences in your workflows.

I forget, did we try putting the JS expression in the valueFrom field? That works for me locally, I think this would be the equivalent code for you:

 trim:
    run: tools/trimmomatic.cwl

    in:
      reads1:
        source: read1
        valueFrom: >
          ${
              self.format = "http://edamontology.org/format_1930";
              return self;
          }
      reads2:
        source: read2
        valueFrom: >
          ${
              self.format = "http://edamontology.org/format_1930";
              return self;
          }
      end_mode:
        default: PE
      nthreads:
        valueFrom: ${ return 2; }
      illuminaClip:
        source: adaptersx
        valueFrom: |
          ${ return {
               "adapters": self,
               "seedMismatches": 1,
               "palindromeClipThreshold": 20,
               "simpleClipThreshold": 20,
               "minAdapterLength": 4,
               "keepBothReads": true };
           }

    out: [output_log, reads1_trimmed, reads1_trimmed_unpaired, reads2_trimmed_paired, reads2_trimmed_unpaired]

You'll need to make sure the Workflow level requirements has something like this

requirements:
  SchemaDefRequirement:
    types:
      - $import: trimmomatic-illumina_clipping.yaml 
evan-wehi commented 7 years ago

@mr-c That is a far simpler solution and possibly even consistent with my slowly developing understanding of how CWL works. It also works with both the reference implementation and toil.

Thanks!