broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 359 forks source link

CWL: outputs secondaryFiles could not be found when using asterisk in glob #4546

Closed Shenglai closed 1 year ago

Shenglai commented 5 years ago

Hi, I was trying to have a VCF related workflow, which involves gatk4, picard tools.

As an example, lets say I want to call gatk4 first to get some VCF files, and use picard to sort them.

if i have gatk4.cwl output as

outputs:
  vcf_list:
    type: File[]
    outputBinding:
      glob: '*.vcf.gz'
    secondaryFiles: [.tbi]

and next picard sort has input array (w/ or w/o secondaryFiles here doesn’t matter from my tests. Neither works and will have the same error)

inputs:
  vcf:
    type:
      type: array
      items: File
      inputBinding:
        prefix: I=
        separate: false

After gatk4 finishes, the execution dir will look like

drwx------ 3 root root 4.0K Jan 14 19:16 genomicsdb-0
-rw-r--r-- 3 root root 5.7K Jan 14 20:17 genomicsdb-0.vcf.gz
-rw-r--r-- 2 root root  105 Jan 14 20:17 genomicsdb-0.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:17 genomicsdb-1
-rw-r--r-- 3 root root 927K Jan 14 20:32 genomicsdb-1.vcf.gz
-rw-r--r-- 2 root root 7.6K Jan 14 20:32 genomicsdb-1.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:29 genomicsdb-2
-rw-r--r-- 3 root root 554K Jan 14 20:31 genomicsdb-2.vcf.gz
-rw-r--r-- 2 root root  11K Jan 14 20:31 genomicsdb-2.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:41 genomicsdb-3
-rw-r--r-- 3 root root 813K Jan 14 20:30 genomicsdb-3.vcf.gz
-rw-r--r-- 2 root root  11K Jan 14 20:30 genomicsdb-3.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 19:52 genomicsdb-4
-rw-r--r-- 3 root root 620K Jan 14 20:32 genomicsdb-4.vcf.gz
-rw-r--r-- 2 root root  12K Jan 14 20:32 genomicsdb-4.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 20:04 genomicsdb-5
-rw-r--r-- 3 root root  50K Jan 14 20:17 genomicsdb-5.vcf.gz
-rw-r--r-- 2 root root  746 Jan 14 20:17 genomicsdb-5.vcf.gz.tbi
drwx------ 3 root root 4.0K Jan 14 20:05 genomicsdb-6
-rw-r--r-- 3 root root 673K Jan 14 20:31 genomicsdb-6.vcf.gz
-rw-r--r-- 2 root root  13K Jan 14 20:31 genomicsdb-6.vcf.gz.tbi
drwxr-xr-x 2 root root 4.0K Jan 14 20:32 glob-330eecb06b4c0ad6b45febf0c8001b04
-rw-r--r-- 1 root root  168 Jan 14 20:32 glob-330eecb06b4c0ad6b45febf0c8001b04.list
drwxr-xr-x 2 root root 4.0K Jan 14 20:32 glob-b34dfc006a981a93d6da067cf50036fe
-rw-r--r-- 1 root root    0 Jan 14 20:32 glob-b34dfc006a981a93d6da067cf50036fe.list
drwxr-xr-x 2 root root 4.0K Jan 14 20:32 glob-ce2a0ab5d8c37a6d061c814f835853ee
-rw-r--r-- 1 root root  140 Jan 14 20:32 glob-ce2a0ab5d8c37a6d061c814f835853ee.list
-rw-r--r-- 1 root root  309 Jan 14 19:16 gvcf_path.list
-rw-r--r-- 1 root root    2 Jan 14 20:32 rc
-rw-r--r-- 1 root root 1.8K Jan 14 19:16 sample_path.map
-rw-r--r-- 1 root root  15K Jan 14 19:16 script
-rw-r--r-- 1 root root 2.1K Jan 14 19:16 script.submit
-rw-r--r-- 1 root root  332 Jan 14 20:32 stderr
-rw-r--r-- 1 root root    0 Jan 14 19:16 stderr.submit
-rw-r--r-- 1 root root 155K Jan 14 20:32 stdout
-rw-r--r-- 1 root root   24 Jan 14 19:16 stdout.submit

However, next step picard sort will not pick up the vcf.gz and vcf.gz.tbi from the execution dir, but from the intermediate dir glob-*, which look like

glob-330eecb06b4c0ad6b45febf0c8001b04:
total 56K
-rw-r--r-- 1 root root  277 Jan 14 20:32 cromwell_glob_control_file
-rw-r--r-- 2 root root  105 Jan 14 20:17 genomicsdb-0.vcf.gz.tbi
-rw-r--r-- 2 root root 7.6K Jan 14 20:32 genomicsdb-1.vcf.gz.tbi
-rw-r--r-- 2 root root  11K Jan 14 20:31 genomicsdb-2.vcf.gz.tbi
-rw-r--r-- 2 root root  11K Jan 14 20:30 genomicsdb-3.vcf.gz.tbi
-rw-r--r-- 2 root root  12K Jan 14 20:32 genomicsdb-4.vcf.gz.tbi
-rw-r--r-- 2 root root  746 Jan 14 20:17 genomicsdb-5.vcf.gz.tbi
-rw-r--r-- 2 root root  13K Jan 14 20:31 genomicsdb-6.vcf.gz.tbi

glob-b34dfc006a981a93d6da067cf50036fe:
total 512
-rw-r--r-- 1 root root 277 Jan 14 20:32 cromwell_glob_control_file

glob-ce2a0ab5d8c37a6d061c814f835853ee:
total 3.6M
-rw-r--r-- 1 root root  277 Jan 14 20:32 cromwell_glob_control_file
-rw-r--r-- 3 root root 5.7K Jan 14 20:17 genomicsdb-0.vcf.gz
-rw-r--r-- 3 root root 927K Jan 14 20:32 genomicsdb-1.vcf.gz
-rw-r--r-- 3 root root 554K Jan 14 20:31 genomicsdb-2.vcf.gz
-rw-r--r-- 3 root root 813K Jan 14 20:30 genomicsdb-3.vcf.gz
-rw-r--r-- 3 root root 620K Jan 14 20:32 genomicsdb-4.vcf.gz
-rw-r--r-- 3 root root  50K Jan 14 20:17 genomicsdb-5.vcf.gz
-rw-r--r-- 3 root root 673K Jan 14 20:31 genomicsdb-6.vcf.gz

As you can see, here vcf.gz and vcf.gz.tbi are stored under different directories. However, the next picard sort step will be only looking at the directory where all vcf.gz live, which leads to the error:

Could not localize /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi -> /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-picard_sortvcf/inputs/2004815296/genomicsdb-0.vcf.gz.tbi:',
 u"\t/mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi doesn't exist",
 u'\tFile not found /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-picard_sortvcf/inputs/2004815296/genomicsdb-0.vcf.gz.tbi -> /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi',
 u'\tFile not found /mnt/glusterfs/genomel-cohort-cwl/cromwell-executions/cwl_temp_file_6dfd1508-a107-491d-9cc2-8984f8e84977.cwl/6dfd1508-a107-491d-9cc2-8984f8e84977/call-gatk4_cohort_genotyping/shard-0/gatk4_cohort_genotyping.cwl/09c59ed5-8631-415c-97bc-896553cd775a/call-genomel_pdc_gatk4_cohort_genotyping/execution/glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi'

It shows glob-ce2a0ab5d8c37a6d061c814f835853ee/genomicsdb-0.vcf.gz.tbi doesn't exist, b/c it lives under a different directory glob-330eecb06b4c0ad6b45febf0c8001b04.

Currently, the work around is just make my outputs w/o secondaryFiles, such as:

outputs:
  vcf_list:
    type: File[]
    outputBinding:
      glob: '*.vcf.gz'

Or if it's just a single file, it is working when glob specifically, such as:

outputs:
  sorted_vcf:
    type: File
    outputBinding:
      glob: $(inputs.job_uuid + '.' + inputs.output_prefix + '.vcf.gz')
    secondaryFiles: [.tbi]
multimeric commented 5 years ago

I think this might be related to my issue, #4361, which is the same problem but with WDL

geoffjentry commented 5 years ago

We're going to first look at #4361 and once that is resolved see if indeed this is the same underlying issue