Open multimeric opened 5 years ago
I suppose that is related to the Cromwell on the AWS? On the HPC cluster it puts everything in the correct places
Sorry, I should have clarified, this is with the local backend (ie cromwell run xxx.wdl
). In the AWS backend it puts files into S3 buckets, which is different again
What's strange is that the second time I run this, it divides the inputs into a different set of folders, and it works:
$ tree cromwell-executions/trio/73300e3a-1776-4db4-8113-fb1e91ab4e8e/call-germline_variant_calling/shard-0/germline_variant_calling/ee662960-59cc-412f-be94-2c5d948d7a15/call-haplotype_caller/inputs
cromwell-executions/trio/73300e3a-1776-4db4-8113-fb1e91ab4e8e/call-germline_variant_calling/shard-0/germline_variant_calling/ee662960-59cc-412f-be94-2c5d948d7a15/call-haplotype_caller/inputs
├── -290704826
│ ├── alignment.merged.bam
│ └── alignment.merged.bam.bai
└── 379983236
├── cosmic_test.vcf.gz
├── cosmic_test.vcf.gz.tbi
├── exons.bed
├── GenomeAnalysisTK.jar
├── ucsc.hg19.dict
├── ucsc.hg19.fasta
├── ucsc.hg19.fasta.fai
└── ucsc.hg19.fasta.gz
What's interesting is that the name of these input folders stays the same when the inputs stay the same, but changes when the inputs change. So maybe this is some kind of caching mechanism, to do with the fact that Cromwell hardlinks files to each other? It's still a problem though, because it means Cromwell runs are undeterministic.
I've just encountered this same issue on the Google Cloud backend. I have a task that produces a bam and a bam index, and a second task that uses those two files as inputs (truncated for brevity):
task process_bam {
output {
File dedup_recal_bam = glob('*recal.bam')[0]
File dedup_recal_bai = glob('*recal.bam.bai')[0]
}
}
task bam_qc {
input {
File alignment
File alignment_index
}
}
However, because these two files were obtained using different globs in the previous task, they're put into different folders for the bam_qc
task. I get the following output from the Cromwell log:
2018/11/14 23:59:23 I: Running command: sudo gsutil -q -m cp gs://genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-1a242f868adfdadea2979bf45a8deddc/recal.bam.bai /mnt/local-disk/genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-1a242f868adfdadea2979bf45a8deddc/recal.bam.bai
2018/11/14 23:59:40 I: Running command: sudo gsutil -q -m cp gs://genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-24e893856b331cbd7264cd189c69b969/recal.bam /mnt/local-disk/genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-24e893856b331cbd7264cd189c69b969/recal.bam
So ultimately what the actual script sees is two separate files in different folders, and thus it doesn't think the BAM is indexed. This is a problem!
glob-24e893856b331cbd7264cd189c69b969/recal.bam
glob-1a242f868adfdadea2979bf45a8deddc/recal.bam.bai
In fact, I think this the crux of the problem. If you have two different globs for a file and its index, then they'll be put into different directories in the next task they're used for, and thus the task will probably fail. I think this is not desired behaviour.
Why do "inputs" directories have multiple sub-directories? Why do they have ANY sub-directories? Why doesn't Comwell simply put all the input files in the inputs directory?
Possibly because it allows you to handle files with duplicate filenames?
Then put everything in one directory, unless there's a file name collision, where you then start a 2nd directory.
At the time it's creating the directories, it knows what all the file names are, no?
So, our solution for this problem was to turn the execution directory into an input directory, and make aliases to each of our files in there
I'm attempting to run a HaplotypeCaller job that requires that the BAM and BAI file are in the same directory. However, for some reason Cromwell is putting the inputs into separate subdirectories. For example:
Thus, I get the error:
##### ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed.
The relevant parts of my WDL (simplified for this example) are:
How do I stop Cromwell doing this. Is it possible to force all inputs to go into the same directory?