Epigenomics-Screw / Screw

SCREW: A Reproducible Workflow for Single-Cell Epigenomics
MIT License
7 stars 7 forks source link

Connect preProcess.cwl to pairwise-distance.cwl #26

Open oneillkza opened 6 years ago

oneillkza commented 6 years ago

For a real data set, we need to run preProcess across every library, then gather them together to run pairwise-distance and heatmap (and various other tasks, eg pooling methylation then generating BigWIGs from that).

@mr-c suggested taking a look at https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl for examples of the scatter and gather functionality in CWL.

mr-c commented 6 years ago

Here's a specific example: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/d656541cc09c87715c67f91f4382f88dc61d8778/workflows/orf_prediction.cwl#L45

oneillkza commented 6 years ago

Thanks, @mr-c . I'm still trying to figure out how to scatter across the contents of a directory, though. (ie I have a directory full of input files, and I want to apply the preprocessing workflow to every one of them).

I haven't yet been able to find an example of that anywhere. I have seen some hints that having an input of type Directory and using inputs.inputdir.listing should work, but I'm not having any luck yet.

mr-c commented 6 years ago

Ah, this is because scatter runs before valueFrom -- see the discussion at https://github.com/common-workflow-language/common-workflow-language/issues/419

As a workaround, add a step to turn the directory into an array of Files (discarding any subdirectories) or change your inputs to be an array of Files outright.

Here's an example step with an inline ExpressionTool to convert a Directory to an array of Files:

  directory_to_array:
    in: { directory: some_step/some_directory }
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs: { directory: Directory }
      expression: |
        ${ var i, len = inputs.directory.listing.length;
           for (i = len - 1; i >= 0; i--) {
             if (inputs.directory.listing[i].class != 'File') {
                inputs.directory.listing.splice(i, 1);
             }
           }
           return { "array_of_files": inputs.directory.listing };
        }
      outputs:
        array_of_files: File[]
    out: [ array_of_files ]

or as a slightly more verbose external tool for reuse:

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: ExpressionTool
requirements:
  InlineJavascriptRequirement: {}
label: Convert a Director to an array of Files, skipping subfolders

inputs:
  directory:
    type: Directory

expression: |
  ${
      var i, len = inputs.directory.listing.length;
      for (i = len - 1; i >= 0; i--) {
         if (inputs.directory.listing[i].class != 'File') {
            inputs.directory.listing.splice(i, 1);
         }
       }
      return { "array_of_files": inputs.directory.listing };
  }

outputs:
  array_of_files:
    type: File[]

Here's the same tool in a self-contained workflow and using scatter

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow

requirements: { ScatterFeatureRequirement: {} }

inputs: { dir: Directory }
outputs:
  names:
    type: string[]
    outputSource: list_array/basename

steps:
  directory_to_array:
    in: { directory: dir}
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs: { directory: Directory }
      expression: |
        ${ var i, len = inputs.directory.listing.length;
           for (i = len - 1; i >= 0; i--) {
             if (inputs.directory.listing[i].class != 'File') {
                inputs.directory.listing.splice(i, 1);
             }
           }
           return { "array_of_files": inputs.directory.listing };
        }
      outputs:
        array_of_files: File[]
    out: [ array_of_files ]
  list_array:
    in: { file: directory_to_array/array_of_files }
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs: { file: File }
      expression: |
        ${return { "basename": inputs.file.basename };}
      outputs: { basename: string }
    out: [ basename ]
    scatter: file
oneillkza commented 6 years ago

Estimating this as a 3