Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
After gatk4 finishes, the execution dir will look like
drwx------ 3 root root 4.0K Jan 14 19:16 genomicsdb-0
-rw-r--r-- 3 root root 5.7K Jan 14 20:17 genomicsdb-0.vcf.gz
-rw-r--r-- 2 root root 105 Jan 14 20:17 genomicsdb-0.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:17 genomicsdb-1
-rw-r--r-- 3 root root 927K Jan 14 20:32 genomicsdb-1.vcf.gz
-rw-r--r-- 2 root root 7.6K Jan 14 20:32 genomicsdb-1.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:29 genomicsdb-2
-rw-r--r-- 3 root root 554K Jan 14 20:31 genomicsdb-2.vcf.gz
-rw-r--r-- 2 root root 11K Jan 14 20:31 genomicsdb-2.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:41 genomicsdb-3
-rw-r--r-- 3 root root 813K Jan 14 20:30 genomicsdb-3.vcf.gz
-rw-r--r-- 2 root root 11K Jan 14 20:30 genomicsdb-3.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:52 genomicsdb-4
-rw-r--r-- 3 root root 620K Jan 14 20:32 genomicsdb-4.vcf.gz
-rw-r--r-- 2 root root 12K Jan 14 20:32 genomicsdb-4.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 20:04 genomicsdb-5
-rw-r--r-- 3 root root 50K Jan 14 20:17 genomicsdb-5.vcf.gz
-rw-r--r-- 2 root root 746 Jan 14 20:17 genomicsdb-5.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 20:05 genomicsdb-6
-rw-r--r-- 3 root root 673K Jan 14 20:31 genomicsdb-6.vcf.gz
-rw-r--r-- 2 root root 13K Jan 14 20:31 genomicsdb-6.vcf.gz.tbi
drwxr-xr-x 2 root root 4.0K Jan 14 20:32 glob-330eecb06b4c0ad6b45febf0c8001b04
-rw-r--r-- 1 root root 168 Jan 14 20:32 glob-330eecb06b4c0ad6b45febf0c8001b04.list
drwxr-xr-x 2 root root 4.0K Jan 14 20:32 glob-b34dfc006a981a93d6da067cf50036fe
-rw-r--r-- 1 root root 0 Jan 14 20:32 glob-b34dfc006a981a93d6da067cf50036fe.list
drwxr-xr-x 2 root root 4.0K Jan 14 20:32 glob-ce2a0ab5d8c37a6d061c814f835853ee
-rw-r--r-- 1 root root 140 Jan 14 20:32 glob-ce2a0ab5d8c37a6d061c814f835853ee.list
-rw-r--r-- 1 root root 309 Jan 14 19:16 gvcf_path.list
-rw-r--r-- 1 root root 2 Jan 14 20:32 rc
-rw-r--r-- 1 root root 1.8K Jan 14 19:16 sample_path.map
-rw-r--r-- 1 root root 15K Jan 14 19:16 script
-rw-r--r-- 1 root root 2.1K Jan 14 19:16 script.submit
-rw-r--r-- 1 root root 332 Jan 14 20:32 stderr
-rw-r--r-- 1 root root 0 Jan 14 19:16 stderr.submit
-rw-r--r-- 1 root root 155K Jan 14 20:32 stdout
-rw-r--r-- 1 root root 24 Jan 14 19:16 stdout.submit
However, next step picard sort will not pick up the vcf.gz and vcf.gz.tbi from the execution dir, but from the intermediate dir glob-*, which look like
glob-330eecb06b4c0ad6b45febf0c8001b04:
total 56K
-rw-r--r-- 1 root root 277 Jan 14 20:32 cromwell_glob_control_file
-rw-r--r-- 2 root root 105 Jan 14 20:17 genomicsdb-0.vcf.gz.tbi
-rw-r--r-- 2 root root 7.6K Jan 14 20:32 genomicsdb-1.vcf.gz.tbi
-rw-r--r-- 2 root root 11K Jan 14 20:31 genomicsdb-2.vcf.gz.tbi
-rw-r--r-- 2 root root 11K Jan 14 20:30 genomicsdb-3.vcf.gz.tbi
-rw-r--r-- 2 root root 12K Jan 14 20:32 genomicsdb-4.vcf.gz.tbi
-rw-r--r-- 2 root root 746 Jan 14 20:17 genomicsdb-5.vcf.gz.tbi
-rw-r--r-- 2 root root 13K Jan 14 20:31 genomicsdb-6.vcf.gz.tbi
glob-b34dfc006a981a93d6da067cf50036fe:
total 512
-rw-r--r-- 1 root root 277 Jan 14 20:32 cromwell_glob_control_file
glob-ce2a0ab5d8c37a6d061c814f835853ee:
total 3.6M
-rw-r--r-- 1 root root 277 Jan 14 20:32 cromwell_glob_control_file
-rw-r--r-- 3 root root 5.7K Jan 14 20:17 genomicsdb-0.vcf.gz
-rw-r--r-- 3 root root 927K Jan 14 20:32 genomicsdb-1.vcf.gz
-rw-r--r-- 3 root root 554K Jan 14 20:31 genomicsdb-2.vcf.gz
-rw-r--r-- 3 root root 813K Jan 14 20:30 genomicsdb-3.vcf.gz
-rw-r--r-- 3 root root 620K Jan 14 20:32 genomicsdb-4.vcf.gz
-rw-r--r-- 3 root root 50K Jan 14 20:17 genomicsdb-5.vcf.gz
-rw-r--r-- 3 root root 673K Jan 14 20:31 genomicsdb-6.vcf.gz
As you can see, here vcf.gz and vcf.gz.tbi are stored under different directories.
However, the next picard sort step will be only looking at the directory where all vcf.gz live, which leads to the error:
Could not localize /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi -> /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-picard_sortvcf/inputs/2004815296/genomicsdb-0.vcf.gz.tbi:',
u"\t/mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi doesn't exist",
u'\tFile not found /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-picard_sortvcf/inputs/2004815296/genomicsdb-0.vcf.gz.tbi -> /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi',
u'\tFile not found /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi'
It shows glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi doesn't exist, b/c it lives under a different directory glob-330eecb06b4c0ad6b45febf0c8001b04.
Currently, the work around is just make my outputs w/o secondaryFiles, such as:
Hi, I was trying to have a VCF related workflow, which involves gatk4, picard tools.
As an example, lets say I want to call gatk4 first to get some VCF files, and use picard to sort them.
if i have
gatk4.cwl
output asand next
picard sort
has input array (w/ or w/osecondaryFiles
here doesn’t matter from my tests. Neither works and will have the same error)After gatk4 finishes, the
execution
dir will look likeHowever, next step
picard sort
will not pick up thevcf.gz
andvcf.gz.tbi
from theexecution
dir, but from the intermediate dirglob-*
, which look likeAs you can see, here
vcf.gz
andvcf.gz.tbi
are stored under different directories. However, the nextpicard sort
step will be only looking at the directory where allvcf.gz
live, which leads to the error:It shows
glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi doesn't exist
, b/c it lives under a different directoryglob-330eecb06b4c0ad6b45febf0c8001b04
.Currently, the work around is just make my outputs w/o secondaryFiles, such as:
Or if it's just a single file, it is working when glob specifically, such as: