broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
988 stars 357 forks source link

Cromwell placing inputs in multiple subdirectories #4361

Open multimeric opened 5 years ago

multimeric commented 5 years ago

I'm attempting to run a HaplotypeCaller job that requires that the BAM and BAI file are in the same directory. However, for some reason Cromwell is putting the inputs into separate subdirectories. For example:

$ tree cromwell-executions/trio/5cb01e4f-98a2-41d8-8946-a84e4e09291f/call-germline_variant_calling/shard-0/germline_variant_calling/5f7e982d-8f5d-4db4-bb08-9e47532da496/call-haplotype_caller/inputs
cromwell-executions/trio/5cb01e4f-98a2-41d8-8946-a84e4e09291f/call-germline_variant_calling/shard-0/germline_variant_calling/5f7e982d-8f5d-4db4-bb08-9e47532da496/call-haplotype_caller/inputs
├── 1155873852
│   └── recal.bam.bai
├── -170302265
│   └── recal.bam
└── 379983236
    ├── cosmic_test.vcf.gz
    ├── cosmic_test.vcf.gz.tbi
    ├── exons.bed
    ├── GenomeAnalysisTK.jar
    ├── ucsc.hg19.dict
    ├── ucsc.hg19.fasta
    ├── ucsc.hg19.fasta.fai
    └── ucsc.hg19.fasta.gz

Thus, I get the error: ##### ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed.

The relevant parts of my WDL (simplified for this example) are:

task process_bam {
    input {
        File bam
        File bai
        File gatk
        File reference
        Array[File] reference_indices
        Array[File] realigner_knowns
        Array[File] realigner_known_indices
        Array[File] bqsr_knowns
        Array[File] bqsr_known_indices
        File intervals
    }

    command {
        OUTPUT_DIR=`pwd`
        cd /app
        /app/process_bam_docker.py \
        --bam "${bam}" \
        --bai "${bai}" \
        --gatk "${gatk}" \
        --ref "${reference}" \
        ${sep=" " prefix("--realigner-known ", realigner_knowns)} \
        ${sep=" " prefix("--bqsr-known ", bqsr_knowns)} \
        --intervals  "${intervals}" \
        --indel-realigner \
        --output-dir "$OUTPUT_DIR"
    }

    runtime {
        docker: "988908462339.dkr.ecr.ap-southeast-2.amazonaws.com/dx_process_bam:latest"
    }

    output {
        File dedup_bam = glob('*dedup.bam')[0]
        File dedup_matrix = glob('*dedup.metrics')[0]
        File dedup_recal_bam = glob('*recal.bam')[0]
        File dedup_recal_bai = glob('*recal.bam.bai')[0]
        File dedup_recal_counts = glob('*recal.counts')[0]
    }
}

task haplotype_caller {
    input {
        File reference
        File gatk
        Array[File] reference_indices
        Int interval_padding
        File intervals
        File alignment
        File alignment_index
        File dbsnp
        File dbsnp_index
    }

    command {
        /app/docker.py \
        --genome-gz "${reference}" \
        java -jar "${gatk}" \
        --analysis_type HaplotypeCaller \
        --emitRefConfidence GVCF \
        --annotation AlleleBalance \
        --annotation GCContent \
        --annotation GenotypeSummaries \
        --annotation LikelihoodRankSumTest \
        --annotation StrandBiasBySample \
        --annotation VariantType \
        --logging_level INFO \
        --interval_padding "${interval_padding}" \
        --intervals "${intervals}" \
        --input_file "${alignment}" \
        --dbsnp "${dbsnp}" \
        --out .
    }

    runtime {
        docker: "988908462339.dkr.ecr.ap-southeast-2.amazonaws.com/dx_variant_calling:latest"
    }

    output {
        File vcf = glob("*.vcf")[0]
        File tbi = glob("*.tbi")[0]
    }
}

workflow germline_variant_calling {
    input {
        File gatk

        # dbsnp
        File dbsnp
        File dbsnp_index

        # Regions
        File intervals
        Int interval_padding

        # Reference
        Array[File] gatk_reference_indices
        File reference

        # BAM processing
        Array[File] realigner_knowns
        Array[File] realigner_known_indices
        Array[File] bqsr_knowns
        Array[File] bqsr_known_indices
    }

    call process_bam {
        input:
            gatk = gatk,
            bam = merge_bam.alignment,
            bai = merge_bam.alignment_index,
            intervals = intervals,
            reference_indices = gatk_reference_indices,
            reference = reference,
            realigner_knowns = realigner_knowns,
            realigner_known_indices = realigner_known_indices,
            bqsr_knowns = bqsr_knowns,
            bqsr_known_indices = bqsr_known_indices,
    }

    call haplotype_caller {
        input:
            reference = reference,
            gatk = gatk,
            reference_indices = gatk_reference_indices,
            intervals = intervals,
            alignment = process_bam.dedup_recal_bam,
            alignment_index = process_bam.dedup_recal_bai,
            dbsnp = dbsnp,
            dbsnp_index = dbsnp_index,
            interval_padding = interval_padding
    }

    output {
        File gvcf = haplotype_caller.vcf
        File tbi = haplotype_caller.tbi
    }
}

How do I stop Cromwell doing this. Is it possible to force all inputs to go into the same directory?

TimurIs commented 5 years ago

I suppose that is related to the Cromwell on the AWS? On the HPC cluster it puts everything in the correct places

multimeric commented 5 years ago

Sorry, I should have clarified, this is with the local backend (ie cromwell run xxx.wdl). In the AWS backend it puts files into S3 buckets, which is different again

multimeric commented 5 years ago

What's strange is that the second time I run this, it divides the inputs into a different set of folders, and it works:

$ tree cromwell-executions/trio/73300e3a-1776-4db4-8113-fb1e91ab4e8e/call-germline_variant_calling/shard-0/germline_variant_calling/ee662960-59cc-412f-be94-2c5d948d7a15/call-haplotype_caller/inputs             
cromwell-executions/trio/73300e3a-1776-4db4-8113-fb1e91ab4e8e/call-germline_variant_calling/shard-0/germline_variant_calling/ee662960-59cc-412f-be94-2c5d948d7a15/call-haplotype_caller/inputs
├── -290704826
│   ├── alignment.merged.bam
│   └── alignment.merged.bam.bai
└── 379983236
    ├── cosmic_test.vcf.gz
    ├── cosmic_test.vcf.gz.tbi
    ├── exons.bed
    ├── GenomeAnalysisTK.jar
    ├── ucsc.hg19.dict
    ├── ucsc.hg19.fasta
    ├── ucsc.hg19.fasta.fai
    └── ucsc.hg19.fasta.gz

What's interesting is that the name of these input folders stays the same when the inputs stay the same, but changes when the inputs change. So maybe this is some kind of caching mechanism, to do with the fact that Cromwell hardlinks files to each other? It's still a problem though, because it means Cromwell runs are undeterministic.

multimeric commented 5 years ago

I've just encountered this same issue on the Google Cloud backend. I have a task that produces a bam and a bam index, and a second task that uses those two files as inputs (truncated for brevity):

task process_bam {
    output {
        File dedup_recal_bam = glob('*recal.bam')[0]
        File dedup_recal_bai = glob('*recal.bam.bai')[0]
    }
}

task bam_qc {
    input {
        File alignment
        File alignment_index
    }
}

However, because these two files were obtained using different globs in the previous task, they're put into different folders for the bam_qc task. I get the following output from the Cromwell log:

2018/11/14 23:59:23 I: Running command: sudo gsutil -q -m cp gs://genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-1a242f868adfdadea2979bf45a8deddc/recal.bam.bai /mnt/local-disk/genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-1a242f868adfdadea2979bf45a8deddc/recal.bam.bai
2018/11/14 23:59:40 I: Running command: sudo gsutil -q -m cp gs://genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-24e893856b331cbd7264cd189c69b969/recal.bam /mnt/local-disk/genovic-cromwell/cromwell-execution/trio/c9e76c9b-3b57-4759-8fb5-ea26e87c4fe0/call-germline_variant_calling/shard-0/germline_variant_calling/01924cea-59a3-46af-a281-0ff1a72e6e8c/call-process_bam/glob-24e893856b331cbd7264cd189c69b969/recal.bam

So ultimately what the actual script sees is two separate files in different folders, and thus it doesn't think the BAM is indexed. This is a problem!

glob-24e893856b331cbd7264cd189c69b969/recal.bam
glob-1a242f868adfdadea2979bf45a8deddc/recal.bam.bai
multimeric commented 5 years ago

In fact, I think this the crux of the problem. If you have two different globs for a file and its index, then they'll be put into different directories in the next task they're used for, and thus the task will probably fail. I think this is not desired behaviour.

GregTD42 commented 3 years ago

Why do "inputs" directories have multiple sub-directories? Why do they have ANY sub-directories? Why doesn't Comwell simply put all the input files in the inputs directory?

multimeric commented 3 years ago

Possibly because it allows you to handle files with duplicate filenames?

GregTD42 commented 3 years ago

Then put everything in one directory, unless there's a file name collision, where you then start a 2nd directory.

At the time it's creating the directories, it knows what all the file names are, no?

GregTD42 commented 3 years ago

So, our solution for this problem was to turn the execution directory into an input directory, and make aliases to each of our files in there