Open ranijames opened 6 years ago
It's solved by checking if that folder exists already or not, let me get on top of it.
glob: $(inputs.spladder_outDir)/spladder
and I see that spladder_outDir
is set to /Alignment/spladder_out/
which is an absolute path.
At https://www.commonwl.org/v1.0/CommandLineTool.html#CommandOutputBinding
glob
: Find files relative to the output directory
So an absolute path is not allowed here
Likewise with InitialWorkDirRequirement.listing
:
https://www.commonwl.org/v1.0/CommandLineTool.html#Dirent
entryname
: The name of the file or subdirectory to create in the output directory.
Again, this can't be an absolute path, must be relative to the output directory
@mr-c Thanks for the reply.
If I do not give an absolute path then CWL will start from scratch to run the script, as if there are not output file. I need CWL to check the already existing output Directory (/Alignment/spladder_out/
) and then run for the non-existing output files, that is , in this case merged graph.
Not to start as a new run. But now, when I give a relative path the CWL is running as new fresh run again re running and creating the graphs for each samples and then merging. Whereas, the graphs are already produced and I need CWL to run the merging part as I mentioned in parameters.
Okay. In CWL a step is modeled as a user defined function taking the user inputs and making new data from them (copying the inputs if they need to be modified during execution). If your data is quite big you can use an extension to v1.0 to allow editing in-place that was developed for astronomy users:
https://github.com/common-workflow-language/cwltool/blob/master/cwltool/extensions.yml#L24
(requires --enable-ext
)
While this is being added to v1.1, this is not yet implemented in v1.0 beyond cwtool
, toil-cwl-runner
, and Arvados.
Thanks @mr-c . Yes, Thanks for the suggestion yes I am using cwltool v1.0. I will try this (--enable-ext ) out. Just to rephrase my question again to make it more clearer.
I need my CWL script go check the existing output directory (produced from previous run) and start the run only for missing file in the output directory. So normally when I run the same spladder tool on command line it throws following message - All result files already exist.
and continue with next step
It is three times time consuming, otherwise.
@mr-c , So i tried with following modifications in cwlcommadline tool
cwlVersion: v1.0
class: CommandLineTool
doc: Spladder
baseCommand: [python2.7, /cluster/home/aalva/spladder/python/spladder.py]
hints:
cwltool:InplaceUpdateRequirement:
inplaceUpdate: true
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
listing:
- entry: "$({class: 'Directory', listing: []})"
entryname: $(inputs.spladder_outDir)
writable: true
inputs:
spladder_gtf:
type: File
inputBinding:
position: 3
prefix: -a
spladder_bams:
type: File[]
inputBinding:
position: 1
prefix: -b
secondaryFiles: .bai
spladder_outDir:
type: string
inputBinding:
position: 2
prefix: -o
spladder_phase2:
type: string
inputBinding:
position: 6
prefix: -T
spladder_merge_graphs:
type: string
inputBinding:
position: 5
prefix: -M
spladder_primary_alignment:
type: string
inputBinding:
position: 10
prefix: -P
spladder_confidence:
type: int
inputBinding:
position: 4
prefix: -c
spladder_alt:
type: string
inputBinding:
position: 7
prefix: -t
spladder_validate:
type: string
inputBinding:
position: 8
prefix: -V
spladder_RL:
type: int
inputBinding:
position: 9
prefix: -n
outputs:
spladder_out:
type: Directory
outputBinding:
glob: $(inputs.spladder_outDir)/spladder
$namespaces:
cwltool: http://commonwl.org/cwltool#
And I ran the above CWL command line tool as,
cwltool --enable-ext /spladder_part1.cwl /part2.yml
But it is throwing the following warning,
WARNING: Output directory ./spladder_out does not exist - will be created
Where as the spladder_out directory exists in the working directory with outputs in parts. I wanted somehow to redirect the tool or tell the tool to look in the output directory and run for the remaining outputs.
Try
cwlVersion: v1.0
class: CommandLineTool
doc: Spladder
baseCommand: [python2.7, /cluster/home/aalva/spladder/python/spladder.py]
hints:
cwltool:InplaceUpdateRequirement:
inplaceUpdate: true
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
listing:
- entry: $(inputs.spladder_dir)
writable: true
inputs:
spladder_gtf:
type: File
inputBinding:
position: 3
prefix: -a
spladder_bams:
type: File[]
inputBinding:
position: 1
prefix: -b
secondaryFiles: .bai
spladder_dir:
type: Directory
inputBinding:
position: 2
prefix: -o
spladder_phase2:
type: string
inputBinding:
position: 6
prefix: -T
spladder_merge_graphs:
type: string
inputBinding:
position: 5
prefix: -M
spladder_primary_alignment:
type: string
inputBinding:
position: 10
prefix: -P
spladder_confidence:
type: int
inputBinding:
position: 4
prefix: -c
spladder_alt:
type: string
inputBinding:
position: 7
prefix: -t
spladder_validate:
type: string
inputBinding:
position: 8
prefix: -V
spladder_RL:
type: int
inputBinding:
position: 9
prefix: -n
outputs:
spladder_out:
type: Directory
outputBinding:
glob: $(inputs.spladder_dir.basename)/spladder
$namespaces:
cwltool: http://commonwl.org/cwltool#
Hello @mr-c Thanks for the input now the spladder_dir: in yml should be also a directory right? Currently it is passed as a string,
spladder_dir:
class: Directory
path: /spaladder_out/spladder/
While runing the new script cwltool --enable-ext /spladder_part_test.cwl /part2.yml
It is still starting from as a fresh run without looking into the spladder_dir:
.
Currently, the spladder_dir consists of following files,
ls
genes_graph_conf2.C3N-02289_4_5_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.MM909432_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.T1185B_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.RCC16_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.RCC14_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.T1015A_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.MM909362_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.Me290_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.MM170111_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.Me275_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.C3N-02671_08_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.C3N-02671_10_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.EOC04_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.JM-2_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.GY150806_L1Aligned.sortedByCoord.out.pickle
genes_graph_conf2.C3N-02289_10_L1Aligned.sortedByCoord.out.pickle
I want thespladder.cwl
to identify them as input and merge them
Now, with the above solution, it is starting from creating graphs (that is the *.pickle
graphs) for each BAM file in the array and this an unneccessary step, as they already exists.
Expected Behavior
I have CWL script which is used to run an in house python script which takes in BAM files as input and gives out graphs. The script can be executed in parts that is in first part generate the graphs and in the second part merge the generated graphs. Currently if I run the python scripts on command line it creates the graphs and in the second run it searches in the output directory and merges the graph. I also notice that CWL writes all output to temp strach directory and at the end writes the outputs to the defined output directory. And so it automatically never search in the working directory. And If I give a absolute path then it is throwing error message.
Actual Behavior
The script should run and say
- All result files already exist.
and continue with merging the graph as the parameter given ismerge_graphs
The CWL workflow code
The spladder command line.
Full Traceback
The YML file
Your Environment
cwltool --version v1.0