bespin-workflows / exomeseq-gatk4

Whole Exome Sequencing in CWL using GATK4
MIT License
0 stars 2 forks source link

Update fastqc, trim_galore to improve cache-ability in preprocessing #9

Closed dleehr closed 5 years ago

dleehr commented 5 years ago

When using cwltool with the --cachedir option, its command_line_tool builds up a cache key from the tool's input files, command-line and requirements (e.g. docker image). Through experimentation, I found that the command-lines for fastqc and trim_galore were changing on each run of the workflow.

See https://github.com/common-workflow-language/cwltool/blob/4a31f2a1c1163492ae37bbc748a299e8318c462c/cwltool/command_line_tool.py#L328-L355

These tools used $(runtime.outdir) to build their command-lines. The runtime outdir is a random directory and changes on every run, causing these steps to never be cacheable. Since all of the downstream processing depends on the trimmed reads, the rest of the workflow was never cacheable.

This PR changes those definitions to use CWL features (InitialWorkdirRequirement for fastqc) or lean on tool default behavior (for trim_galore), allowing every step of the preprocessing workflow to be cached.

dleehr commented 5 years ago

Tested on calrissian 0.5.0 and current cwltool master