broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Malformed VCF headers in HaplotypeCallerSpark #4821

Open rdocking opened 6 years ago

rdocking commented 6 years ago

Hi there - I'm having some problems running HaplotypeCallerSpark on RNA-Seq data.

The tl;dr is that, on some occasions when HaplotypeCallerSpark runs out of memory, it finishes successfully, but writes out a VCF file without a proper header.

Example command syntax is:

gatk-launch \
--java-options '-Xms800m -Xmx94349m -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/debug/' \
HaplotypeCallerSpark \
--reference /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/ucsc/hg19.2bit \
--annotation MappingQualityRankSumTest --annotation MappingQualityZero \
--annotation QualByDepth --annotation ReadPosRankSumTest \
--annotation RMSMappingQuality --annotation BaseQualityRankSumTest \
--annotation FisherStrand --annotation MappingQuality \
--annotation DepthPerAlleleBySample --annotation Coverage \
-I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/work/align/MOLM13_rep1/MOLM13_rep1-dedup.splitN.bam \
-L /projects/karsanlab/rdocking/KARSANBIO-1254_pipeline/KARSANBIO-1390_rna_seq_runs/data/gatk_debug/chr1_70k.bed \
--interval-set-rule INTERSECTION \
--spark-master local[12] \
--conf spark.local.dir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/debug \
--conf spark.driver.host=localhost \
--conf spark.network.timeout=800 \
--conf spark.executor.heartbeatInterval=100 \
--annotation ClippingRankSumTest --annotation DepthPerSampleHC \
--emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 \
--output MOLM13_rep1-chr1-70k-gatk-haplotype.vcf

When I run this command on a single chromosome with -Xmx94349m, the command completes successfully, but the resulting VCF header does not contain this expected header line:

##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">

(along with most of the other header lines associated with gVCF output). When I up the memory request to 110g for the same input files, the proper VCF header is present.

I discovered this in the context of running GATK within the bcbio pipeline, the original descriptions are at: https://github.com/bcbio/bcbio-nextgen/issues/2375

On the linked issue, I have examples of GATK output from runs that produced correct and incorrect output - please let me know if there's any other information you need. Thanks!

lbergelson commented 6 years ago

@rdocking Very strange. Did it produce any stack trace in the output? It sounds like maybe executors are dying and something isn't retrying correctly in some way.

rdocking commented 6 years ago

@lbergelson - I didn't see any stack trace in the output. Here are examples from similar samples:

gatk_debug_60k.txt - Runs properly gatk_debug_70k.txt - Malformed output