common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
336 stars 231 forks source link

CWL storing outputs into temp file and throwing error #969

Open ranijames opened 6 years ago

ranijames commented 6 years ago

Expected Behavior

I have CWL script which is used to run an in house python script which takes in BAM files as input and gives out graphs. The script can be executed in parts that is in first part generate the graphs and in the second part merge the generated graphs. Currently if I run the python scripts on command line it creates the graphs and in the second run it searches in the output directory and merges the graph. I also notice that CWL writes all output to temp strach directory and at the end writes the outputs to the defined output directory. And so it automatically never search in the working directory. And If I give a absolute path then it is throwing error message.

Actual Behavior

The script should run and say - All result files already exist. and continue with merging the graph as the parameter given is merge_graphs

The CWL workflow code


#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow

doc: "Workflow Components: Increment -> Spladder -> PeptidePackage"

requirements:
 - class: ScatterFeatureRequirement

inputs:
 spladder_gtf:
  type: File 
 spladder_bams:
  type: File[]
  secondaryFiles:
    .bai
 spladder_outDir:
  type: Directory
 spladder_confidence:
  type: int
 spladder_merge_graphs:
  type: string
 spladder_phase2:
  type: string
 spladder_alt:
  type: string
 spladder_RL:
  type: int
 spladder_validate:
  type: string

outputs:
 incre_output:
  type: File
  outputSource: increment_count/incre_output
 spladder_out1:
  type: Directory
  outputSource: spladder/spladder_out_dir1
 spladder_out2:
  type: Directory
  outputSource: spladder/spladder_out_dir2

steps:
 spladder:
  run: spladder1.cwl
  in:
   spladder_gtf: spladder_gtf
   spladder_bams: spladder_bams
   spladder_outDir: spladder_outDir
   spladder_confidence: spladder_confidence
   spladder_merge_graphs: spladder_merge_graphs
   spladder_phase2: spladder_phase2
   spladder_alt: spladder_alt
   spladder_RL: spladder_RL
   spladder_validate: spladder_validate
  out: [spladder_out_dir1, spladder_out_dir2, spladderFile]

The spladder command line.

cwlVersion: v1.0
class: CommandLineTool
doc: Spladder

baseCommand: [python2.7, /cluster/home/aalva/spladder/python/spladder.py]

requirements:
 - class: InlineJavascriptRequirement
 - class: InitialWorkDirRequirement
   listing: 
    - entry: "$({class: 'Directory', listing: []})"
      entryname: $(inputs.spladder_outDir)
      writable: true

inputs:
 spladder_gtf: 
  type: File
  inputBinding:
   position: 3
   prefix: -a
 spladder_bam: 
  type: File
  inputBinding:
   position: 1
   prefix: -b
  secondaryFiles: .bai
 spladder_outDir:
  type: Directory
  inputBinding:
   position: 2
   prefix: -o
 spladder_phase2:
  type: string
  inputBinding:
   position: 6
   prefix: -T
 spladder_merge_graphs:
  type: string
  inputBinding:
    position: 5
    prefix: -M
 spladder_primary_alignment:
  type: string
  inputBinding:
    position: 10
    prefix: -P
 spladder_confidence:
  type: int
  inputBinding:
    position: 4
    prefix: -c
 spladder_alt:
  type: string
  inputBinding:
    position: 7
    prefix: -t
 spladder_validate:
  type: string
  inputBinding:
    position: 8
    prefix: -V
 spladder_RL:
  type: int
  inputBinding:
    position: 9
    prefix: -n

outputs:
 spladder_out:
  type: Directory
  outputBinding:
   glob: $(inputs.spladder_outDir)/spladder

Full Traceback

Traceback (most recent call last):
  File "/cluster/home/aalva/software/anaconda/envs/py2/lib/python2.7/site-packages/cwltool/executors.py", line 167, in run_jobs
    job.run(runtime_context)
  File "/cluster/home/aalva/software/anaconda/envs/py2/lib/python2.7/site-packages/cwltool/job.py", line 418, in run
    symLink=True, secret_store=runtimeContext.secret_store)
  File "/cluster/home/aalva/software/anaconda/envs/py2/lib/python2.7/site-packages/cwltool/process.py", line 257, in stageFiles
    os.makedirs(p.target, 0o0755)
  File "/cluster/home/aalva/software/anaconda/envs/py2/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/Alignment/spladder_out/'
Workflow error, try again with --debug for more information:
[Errno 17] File exists: '/Alignment/spladder_out/'

The YML file

spladder_gtf: 
 class: File
 path: /usage_examples/gencode.v19.annotation.hs37d5_chr.spladder.gtf
spladder_outDir:/Alignment/spladder_out/
spladder_out_dir1: /spladder_out1
spladder_out_dir2: /spladder_out2
spladder_bams: [
 {class: File, path: /Alignment/C3N-02289_10_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/C3N-02289_4_5_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /cluster/work/grlab/projects/alva_temp/Alignment/C3N-02671_08_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/C3N-02671_10_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/EOC04_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/GY150806_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /JM-2_L1Aligned.sortedByCoord.out.bam},
 {class: File, path:/Alignment/Me275_L1Aligned.sortedByCoord.out.bam},
 {class: File, path:/Alignment/Me290_L1Aligned.sortedByCoord.out.bam},
 {class: File, path:/Alignment/MM170111_L1Aligned.sortedByCoord.out.bam},
 {class: File, path:/Alignment/MM909362_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/MM909432_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/RCC14_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/RCC16_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/T1015A_L1Aligned.sortedByCoord.out.bam},
 {class: File, path: /Alignment/T1185B_L1Aligned.sortedByCoord.out.bam}
]
spladder_confidence: 2
spladder_merge_graphs: merge_graphs
spladder_alt: alt_3prime
spladder_RL: 100
spladder_phase2: y
spladder_primary_alignment: y

Your Environment

psafont commented 6 years ago

It's solved by checking if that folder exists already or not, let me get on top of it.

mr-c commented 6 years ago

glob: $(inputs.spladder_outDir)/spladder

and I see that spladder_outDir is set to /Alignment/spladder_out/ which is an absolute path.

At https://www.commonwl.org/v1.0/CommandLineTool.html#CommandOutputBinding

glob: Find files relative to the output directory

So an absolute path is not allowed here

mr-c commented 6 years ago

Likewise with InitialWorkDirRequirement.listing: https://www.commonwl.org/v1.0/CommandLineTool.html#Dirent

entryname: The name of the file or subdirectory to create in the output directory.

Again, this can't be an absolute path, must be relative to the output directory

ranijames commented 6 years ago

@mr-c Thanks for the reply.

If I do not give an absolute path then CWL will start from scratch to run the script, as if there are not output file. I need CWL to check the already existing output Directory (/Alignment/spladder_out/ ) and then run for the non-existing output files, that is , in this case merged graph.

Not to start as a new run. But now, when I give a relative path the CWL is running as new fresh run again re running and creating the graphs for each samples and then merging. Whereas, the graphs are already produced and I need CWL to run the merging part as I mentioned in parameters.

mr-c commented 6 years ago

Okay. In CWL a step is modeled as a user defined function taking the user inputs and making new data from them (copying the inputs if they need to be modified during execution). If your data is quite big you can use an extension to v1.0 to allow editing in-place that was developed for astronomy users:

https://github.com/common-workflow-language/cwltool/blob/master/cwltool/extensions.yml#L24 (requires --enable-ext)

While this is being added to v1.1, this is not yet implemented in v1.0 beyond cwtool, toil-cwl-runner, and Arvados.

ranijames commented 6 years ago

Thanks @mr-c . Yes, Thanks for the suggestion yes I am using cwltool v1.0. I will try this (--enable-ext ) out. Just to rephrase my question again to make it more clearer.

I need my CWL script go check the existing output directory (produced from previous run) and start the run only for missing file in the output directory. So normally when I run the same spladder tool on command line it throws following message - All result files already exist. and continue with next step

It is three times time consuming, otherwise.

ranijames commented 6 years ago

@mr-c , So i tried with following modifications in cwlcommadline tool

cwlVersion: v1.0
class: CommandLineTool
doc: Spladder

baseCommand: [python2.7, /cluster/home/aalva/spladder/python/spladder.py]

hints:
  cwltool:InplaceUpdateRequirement:
    inplaceUpdate: true
requirements:
 - class: InlineJavascriptRequirement
 - class: InitialWorkDirRequirement
   listing: 
    - entry: "$({class: 'Directory', listing: []})"
      entryname: $(inputs.spladder_outDir)
      writable: true

inputs:
 spladder_gtf: 
  type: File
  inputBinding:
   position: 3
   prefix: -a
 spladder_bams: 
  type: File[]
  inputBinding:
   position: 1
   prefix: -b
  secondaryFiles: .bai
 spladder_outDir:
  type: string
  inputBinding:
   position: 2
   prefix: -o
 spladder_phase2:
  type: string
  inputBinding:
   position: 6
   prefix: -T
 spladder_merge_graphs:
  type: string
  inputBinding:
    position: 5
    prefix: -M
 spladder_primary_alignment:
  type: string
  inputBinding:
    position: 10
    prefix: -P
 spladder_confidence:
  type: int
  inputBinding:
    position: 4
    prefix: -c
 spladder_alt:
  type: string
  inputBinding:
    position: 7
    prefix: -t
 spladder_validate:
  type: string
  inputBinding:
    position: 8
    prefix: -V
 spladder_RL:
  type: int
  inputBinding:
    position: 9
    prefix: -n

outputs:
 spladder_out:
  type: Directory
  outputBinding:
   glob: $(inputs.spladder_outDir)/spladder

$namespaces:
  cwltool: http://commonwl.org/cwltool#

And I ran the above CWL command line tool as,

 cwltool --enable-ext /spladder_part1.cwl /part2.yml

But it is throwing the following warning, WARNING: Output directory ./spladder_out does not exist - will be created

Where as the spladder_out directory exists in the working directory with outputs in parts. I wanted somehow to redirect the tool or tell the tool to look in the output directory and run for the remaining outputs.

mr-c commented 6 years ago

Try

cwlVersion: v1.0
class: CommandLineTool
doc: Spladder

baseCommand: [python2.7, /cluster/home/aalva/spladder/python/spladder.py]

hints:
  cwltool:InplaceUpdateRequirement:
    inplaceUpdate: true
requirements:
 - class: InlineJavascriptRequirement
 - class: InitialWorkDirRequirement
   listing: 
    - entry: $(inputs.spladder_dir)
      writable: true

inputs:
 spladder_gtf: 
  type: File
  inputBinding:
   position: 3
   prefix: -a
 spladder_bams: 
  type: File[]
  inputBinding:
   position: 1
   prefix: -b
  secondaryFiles: .bai
 spladder_dir:
  type: Directory
  inputBinding:
   position: 2
   prefix: -o
 spladder_phase2:
  type: string
  inputBinding:
   position: 6
   prefix: -T
 spladder_merge_graphs:
  type: string
  inputBinding:
    position: 5
    prefix: -M
 spladder_primary_alignment:
  type: string
  inputBinding:
    position: 10
    prefix: -P
 spladder_confidence:
  type: int
  inputBinding:
    position: 4
    prefix: -c
 spladder_alt:
  type: string
  inputBinding:
    position: 7
    prefix: -t
 spladder_validate:
  type: string
  inputBinding:
    position: 8
    prefix: -V
 spladder_RL:
  type: int
  inputBinding:
    position: 9
    prefix: -n

outputs:
 spladder_out:
  type: Directory
  outputBinding:
   glob: $(inputs.spladder_dir.basename)/spladder

$namespaces:
  cwltool: http://commonwl.org/cwltool#
ranijames commented 6 years ago

Hello @mr-c Thanks for the input now the spladder_dir: in yml should be also a directory right? Currently it is passed as a string,

spladder_dir: 
 class: Directory
 path: /spaladder_out/spladder/

While runing the new script cwltool --enable-ext /spladder_part_test.cwl /part2.yml It is still starting from as a fresh run without looking into the spladder_dir: . Currently, the spladder_dir consists of following files, ls

genes_graph_conf2.C3N-02289_4_5_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.MM909432_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.T1185B_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.RCC16_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.RCC14_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.T1015A_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.MM909362_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.Me290_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.MM170111_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.Me275_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.C3N-02671_08_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.C3N-02671_10_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.EOC04_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.JM-2_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.GY150806_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.C3N-02289_10_L1Aligned.sortedByCoord.out.pickle

I want thespladder.cwlto identify them as input and merge them

Now, with the above solution, it is starting from creating graphs (that is the *.picklegraphs) for each BAM file in the array and this an unneccessary step, as they already exists.