Closed hliu2016 closed 6 years ago
Thanks, sorry about that. Could you pass on the VCF file that has the issue so I can take a look at it?
Is this related to #2342 at all?
Hi guys - I'm running into this issue now using the RNA-Seq pipeline (see related discussion on #2390).
Here's what my setup looks like:
details:
- algorithm:
adapters:
- truseq
- illumina
aligner: star
exclude_regions:
- polyx
- lcr
- highdepth
- altcontigs
expression_caller:
- salmon
mark_duplicates: true
strandedness: firststrand
transcriptome_align: false
trim_reads: atropos
variantcaller:
- gatk-haplotype
analysis: RNA-seq
description: MOLM13_rep1
files:
- /projects/karsanlab/rdocking/KARSANBIO-1254_pipeline/KARSANBIO-1390_rna_seq_runs/data/A34048_1.fq.gz
- /projects/karsanlab/rdocking/KARSANBIO-1254_pipeline/KARSANBIO-1390_rna_seq_runs/data/A34048_2.fq.gz
genome_build: hg19
metadata:
batch: MOLM13_rep1_batch
(I don't usually run trimming, but I was trying it to see if it helped with the memory issues from the other ticket). In this case, the run passes through the first-stage of GATK variant calling, then crashes like this:
[May 12, 2018 1:37:16 AM PDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=811073536
java.lang.IllegalStateException: Key END found in VariantContext field INFO at chr4:48991 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:173)
at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:112)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:224)
at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1368)
at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1418)
at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:509)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
[
The relevant commands run were:
SplitN:
[2018-05-11T23:00Z] eccfx01n1.hpc.bcgsc.ca: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms727m -Xmx3272m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpC8Uo9D' SplitNCigarReads -R /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/seq/hg19.fa -I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/align/MOLM13_rep3/MOLM13_rep3-dedup.bam --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpJw9pVK/MOLM13_rep3-dedup.splitN.bam
HaplotypeCaller:
[2018-05-12T02:16Z] eccfx01n2: export SPARK_USER=rdocking && unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmp617YSx' HaplotypeCallerSpark --reference /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/ucsc/hg19.2bit --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage -I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/align/MOLM13_rep2/MOLM13_rep2-dedup.splitN.bam -L /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bedprep/ref-transcripts-rnaseq_clean-glimit-nolcr-nopolyx.bed --interval-set-rule INTERSECTION --spark-master local[32] --conf spark.local.dir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmp617YSx --annotation ClippingRankSumTest --annotation DepthPerSampleHC --emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmp617YSx/MOLM13_rep2-gatk-haplotype.vcf
GenomicsDBImport:
[2018-05-12T08:43Z] eccfx01n3: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep1_batch/chrM/bcbiotx/tmpbFjdnm' GenomicsDBImport --reader-threads 32 --genomicsdb-workspace-path MOLM13_rep1_batch-chrM_0_16571_genomicsdb -L chrM:1-16571 --variant /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep1-gatk-haplotype.vcf.gz
GenotypeGVCFs:
[2018-05-12T08:43Z] eccfx01n3: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpjJJBx4' GenotypeGVCFs --variant gendb:///projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep1_batch/chrM/MOLM13_rep1_batch-chrM_0_16571_genomicsdb -R /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/seq/hg19.fa --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpoTr60e/MOLM13_rep1_batch-chrM_0_16571.vcf.gz -L chrM:1-16571 --use-new-qual-calculator
The actual command that errored out was this one:
[0:apply]: CalledProcessError: Command 'set -o pipefail; unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep3_batch/chr4/bcbiotx/tmp_GnLP6' GenomicsDBImport --reader-threads 32 --genomicsdb-workspace-path MOLM13_rep3_batch-chr4_0_191154276_genomicsdb -L chr4:1-191154276 --variant /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-gatk-haplotype.vcf.gz
Using GATK jar /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep3_batch/chr4/bcbiotx/tmp_GnLP6 -jar /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar GenomicsDBImport --reader-threads 32 --genomicsdb-workspace-path MOLM13_rep3_batch-chr4_0_191154276_genomicsdb -L chr4:1-191154276 --variant /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-gatk-haplotype.vcf.gz
Here are the first few lines of the VCF:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MOLM13_rep3
chr1 11869 . G <NON_REF> . . END=14470 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chr1 14471 . C <NON_REF> . . END=14476 GT:DP:GQ:MIN_DP:PL 0/0:6:15:5:0,15,182
chr1 14477 . G <NON_REF> . . END=14477 GT:DP:GQ:MIN_DP:PL 0/0:9:0:9:0,0,280
It does indeed look like GVCF rather than VCF.
Any ideas? I'm not sure that this isn't a GATK issue rather than a bcbio issue.
@rdocking @roryk This is really a weird problem. I have reproduced this error for quite a few times a month ago. But yesterday I used a subset of the input fastq files to run the pipeline (head 100000 lines), it successfully finished! The output vcf.gz files looks right.
Maybe this is caused by bcbio, not GATK, is it right?
Thanks much for following up on this and for the detailed run information. I haven't been able to reproduce this so appreciate the help trying to determine the cases where it happens to we can push a fix to bcbio or GATK, depending on where the problem lies.
Rod, the workflow you describe and file types are exactly right. We first create gVCFs for each sample, then these get loaded in a GenomicsDB (the failing step), then get joint called.
The error messages indicate that this file:
/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-gatk-haplotype.vcf.gz
is missing the END
line in the header, which should look like:
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
Is that correct? I'm not sure why that would happen, but it would be helpful to confirm and then look into the logs to determine the HaplotypeCallerSpark command that generates that file. If we can reproduce this error with a minimal case outside of bcbio then could try to prepare a bug report for the GATK team. Alternatively, if there is something else going on we can try to track it down. Thanks for the help looking into this.
Hi Brad - thanks for looking into this. I re-ran this yesterday using the new split-by-chromosome logic, and got caught up at a different step (but it seems to be a similar root cause):
For the same sample, for chromosome 1:
Run GATK:
[2018-05-15T02:20Z] eccfx01n3: export SPARK_USER=rdocking && unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE'
HaplotypeCallerSpark --reference /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/ucsc/hg19.2bit
--annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage
-I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/align/MOLM13_rep3/MOLM13_rep3-dedup.splitN.bam -L /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep3-chr1-gatk-haplotype-regions-glimit-nolcr-nopolyx.bed
--interval-set-rule INTERSECTION --spark-master local[32]
--conf spark.local.dir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE
--annotation ClippingRankSumTest --annotation DepthPerSampleHC
--emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE/MOLM13_rep3-chr1-gatk-haplotype.vcf
Compress and index:
[2018-05-15T02:39Z] eccfx01n3: cat /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE/MOLM13_rep3-chr1-gatk-haplotype.vcf | /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/bgzip --threads 32 -c > /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpm7_R_E/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz
[2018-05-15T02:39Z] eccfx01n3: /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/tabix -f -p vcf /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpL3qNEh/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz
FixHeader:
[2018-05-15T07:25Z] eccfx01n3: zgrep ^# /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz > /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOscSa1/MOLM13_rep3-chr1-gatk-haplotype-fixheader-header.vcf
[2018-05-15T07:25Z] eccfx01n3: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && picard FixVcfHeader HEADER=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOscSa1/MOLM13_rep3-chr1-gatk-haplotype-fixheader-header.vcf INPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz OUTPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-chr1-gatk-haplotype-fixheader.vcf.gz
This time, this step is where it crashes:
[1:apply]: CalledProcessError: Command 'set -o pipefail; unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && picard FixVcfHeader HEADER=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOQ0QCk/MOLM13_rep1-chr1-gatk-haplotype-fixheader-header.vcf INPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep1-chr1-gatk-haplotype.vcf.gz OUTPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep1-chr1-gatk-haplotype-fixheader.vcf.gz
23:44:43.965 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/share/picard-2.18.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon May 14 23:44:43 PDT 2018] FixVcfHeader INPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep1-chr1-gatk-haplotype.vcf.gz OUTPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep1-chr1-gatk-haplotype-fixheader.vcf.gz HEADER=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOQ0QCk/MOLM13_rep1-chr1-gatk-haplotype-fixheader-header.vcf CHECK_FIRST_N_RECORDS=-1 ENFORCE_SAME_SAMPLES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon May 14 23:44:43 PDT 2018] Executing as rdocking@eccfx01n4 on Linux 2.6.32-573.8.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.4-SNAPSHOT
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: AD
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: GQ
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: GT
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: MMQ
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: PGT
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: PID
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: PL
INFO 2018-05-14 23:44:44 FixVcfHeader FORMAT line found in HEADER will be added to OUTPUT: SB
INFO 2018-05-14 23:44:44 FixVcfHeader Writing the output VCF.
[Mon May 14 23:44:44 PDT 2018] picard.vcf.FixVcfHeader done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=514850816
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: Key END found in VariantContext field INFO at chr1:11869 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCFheaders by default.
at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:173)
at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:112)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:224)
at picard.vcf.FixVcfHeader.doWork(FixVcfHeader.java:194)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
' returned non-zero exit status 1
In this case, it looks like the intermediate files in bcbiotx
aren't there anymore, but the last VCF files that exist are in:
/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/
Which look like, for each chromosome:
> ls -1 | grep 'chr1-' | grep 'rep1'
MOLM13_rep1-chr1-gatk-haplotype-regions.bed
MOLM13_rep1-chr1-gatk-haplotype-regions-glimit.bed
MOLM13_rep1-chr1-gatk-haplotype-regions-glimit-nolcr.bed
MOLM13_rep1-chr1-gatk-haplotype-regions-glimit-nolcr-nopolyx.bed
MOLM13_rep1-chr1-gatk-haplotype.vcf.gz
MOLM13_rep1-chr1-gatk-haplotype.vcf.gz.tbi
These are the INFO fields from those VCF files:
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
Let me know if there's other information I can provide - here's the complete commands log:
The debug log is ~33M - let me know if there's some part of that that would help.
Thanks again for all your help here!
OK - looks like this is actually something quite weird: the END
info tag is in most (but not all) of the intermediate VCF files:
[2018-05-15 12:14:10]{rdocking@clingen01}/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/> find MOLM13_rep1-chr*.vcf.gz | xargs zgrep '^##INFO=<ID=END' {}
gzip: {}.gz: No such file or directory
MOLM13_rep1-chr10-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr11-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr12-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr13-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr14-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr15-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr16-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr17-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr18-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr19-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr20-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr21-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr22-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr4-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr5-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr6-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr7-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr8-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr9-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chrX-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
(i.e., it's missing from the chr1-chr3 VCF files, but there everywhere else)
Re-running the HaplotypeCaller command gives me a VCF with the correct ID=END
field in the header - perhaps this is something to do with the cluster jobs running out of memory.
Rod; Thanks for following up on this and all of the testing. To try and summarize: the region based approach is working through previous issues with runtimes, but we're getting some outputs with partial headers and missing END. If you re-run those with the same command, the headers are correct, indicating the problem is intermittent. Is that all correct?
If so, I think our next step is to get a correct and incorrect output for a region (with and without the END tag) and then inspect the outputs to look at differences. Ideally we'd do this with direct calls to gatk outside of bcbio. We could then summarize and prepare a bug report for the GATK team, as definitely we don't want to have non-deterministic results and would like to get to the root cause of this issue so we can detect and avoid. Would it be possible to generate those two outputs on your working dataset and share? If you have any tips about how best to generate the problem outputs I can also try to reproduce on my side so I can make these.
Thanks again for the help debugging.
@chapmanb @roryk
Hi, I rerun the HaplotypeCallerSpark command to regenerate the vcf files, but they still have no INFO=<ID=END
lines. I have seen that in the log file reported the warning message like
[2018-05-16T00:39Z] cnode02: 08:39:49.655 WARN HaplotypeCallerSpark - [2018-05-16T00:39Z] cnode02: Warning: HaplotypeCallerSpark is a BETA tool and is not yet ready for use in production [2018-05-16T00:39Z] cnode02: 08:39:49.655 INFO HaplotypeCallerSpark - Initializing engine [2018-05-16T00:39Z] cnode02: 08:39:49.656 INFO HaplotypeCallerSpark - Done initializing engine [2018-05-16T00:39Z] cnode02: 18/05/16 08:39:50 INFO SparkContext: Submitted application: HaplotypeCallerSpark [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - **** [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - The output of this tool DOES NOT match the output of HaplotypeCaller. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - It is under development and should not be used for production work. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - For evaluation only. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - Use the non-spark HaplotypeCaller if you care about the results. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - ****
So, I guess the beta version of HaplotypeCallerSpark caused this problem. Is there any way to use HaplotypeCaller rather than HaplotypeCallerSpark in bcbio?
Thanks for following up on this and all of the testing. To try and summarize: the region based approach is working through previous issues with runtimes, but we're getting some outputs with partial headers and missing END. If you re-run those with the same command, the headers are correct, indicating the problem is intermittent. Is that all correct?
Yes - that's what it looks like.
If so, I think our next step is to get a correct and incorrect output for a region (with and without the END tag) and then inspect the outputs to look at differences. Ideally we'd do this with direct calls to gatk outside of bcbio. We could then summarize and prepare a bug report for the GATK team, as definitely we don't want to have non-deterministic results and would like to get to the root cause of this issue so we can detect and avoid. Would it be possible to generate those two outputs on your working dataset and share? If you have any tips about how best to generate the problem outputs I can also try to reproduce on my side so I can make these.
Agreed - I will try to make/document a shareable examples as you suggest.
Rod - thanks so much for looking into this, I appreciate the help.
@hliu2016 -- if you use a single core bcbio will use GATK HaplotypeCaller, but it's quite slow which is why we've been working to incorporate the Spark based caller. The practical speeds of the single caller approach make it not very useful for reasonable sized datasets. Our validations of the Spark caller (on DNA variant calling, where we have truth sets) look great:
https://github.com/bcbio/bcbio_validations/tree/master/gatk4 https://github.com/bcbio/bcbio_validations/tree/master/deepvariant#deepvariant-v06-release-strelka2-stratification-and-initial-gatk-cnn
but clearly we have a few issues to work through in terms of stability which is what we're hoping to tackle here.
Thanks again for all the helpful discussion and feedback.
@chapmanb, This problem has been reproduced for quite a few times. Could I use gatk 3.8 to work through this problem?
@hliu2016 - you can use GATK 3.8 by using tools_off: gatk4
in your config:
For what it's worth, in my experience using GATK 3.8 within the RNA-Seq pipeline here will be very slow.
Updated testing results:
chr1
job), still gave the improper results when run directlyThis run is from a subset BAM file that I made to try to reproduce the issue more quickly. The BAM file is ~3G, and is shareable if need be. I'm attaching the cluster log output for the 60k job (that succeeeded) and the 70k job (that failed). I can't see anything notable, but I don't have a ton of experience with GATK.
gatk_debug_60k.txt gatk_debug_70k.txt
Here's the submission command for the jobs:
export SPARK_USER=rdocking && \
unset JAVA_HOME && \
export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && \
gatk-launch \
--java-options '-Xms800m -Xmx94349m -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/debug/' \
HaplotypeCallerSpark \
--reference /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/ucsc/hg19.2bit \
--annotation MappingQualityRankSumTest --annotation MappingQualityZero \
--annotation QualByDepth --annotation ReadPosRankSumTest \
--annotation RMSMappingQuality --annotation BaseQualityRankSumTest \
--annotation FisherStrand --annotation MappingQuality \
--annotation DepthPerAlleleBySample --annotation Coverage \
-I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/work/align/MOLM13_rep1/MOLM13_rep1-dedup.splitN.bam \
-L /projects/karsanlab/rdocking/KARSANBIO-1254_pipeline/KARSANBIO-1390_rna_seq_runs/data/gatk_debug/chr1_70k.bed \
--interval-set-rule INTERSECTION \
--spark-master local[12] \
--conf spark.local.dir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/debug \
--conf spark.driver.host=localhost \
--conf spark.network.timeout=800 \
--conf spark.executor.heartbeatInterval=100 \
--annotation ClippingRankSumTest --annotation DepthPerSampleHC \
--emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 \
--output MOLM13_rep1-chr1-70k-gatk-haplotype.vcf
Thanks!
One more note: I scaled the java memory request up a bit further for the failing job to:
--java-options '-Xms800m -Xmx110g
... and it produced a properly formed header. I'll see if I can get the equivalent memory request into my bcbio config and work past the issue that way.
@rdocking, Thanks very much for sharing your experience in this problem. I have also tested that more memory allocation is helpful to produce proper header in the vcf.gz files, but are not sure to what extent the Xmx parameter should be set for different size bam files. I look forward to a systematic solution in the next release of gatk.
Submitted an issue on the GATK side - https://github.com/broadinstitute/gatk/issues/4821
Not sure if there's much that can be done on the bcbio side to guard against this until it's fixed upstream.
Rod -- thanks so much for all this incredible debugging and for submitting the issue upstream to the GATK folks. This helps so much, and I'm 100% agreed that fixing in GATK will provide the most benefit to the most people.
As a temporary way to avoid the issue I've tried to improve the parallelization by chromosome to now split into smaller blocks. Based on your analysis, I hope this will avoid large blocks, result in less memory requirements, and remove the incorrect outputs. This is all implemented in the latest development version and you shouldn't need any changes/adjustment to your configuration.
If you have an opportunity to test this, please let us know if it helps. We could tweak splitting more aggressively if needed as I tried to guess some on what the best number of splits would be.
Thank you again for all the work and help on this.
Confirmed that this works with my setup - the jobs are split into smaller blocks, and all the VCF files were produced with propoer header.
Thanks a lot Brad for helping us work through this!
Rod -- nice one, thanks for the positive report. Glad it's now working for you and I'll close this for now. Hopefully it will also get fixed upstream in GATK to avoid any future issues but glad to have analyses going.
@chapmanb Thanks very much for your nice work. I further hope GATK fix this problem.
@chapmanb,
I have just tested the updated codes, it seems that there are still some bugs for GATK 4.0 in HaplotypeCallerSpark
step. I am running the pipeline on 12 rat samples, and the error is as below:
CalledProcessError: Command 'set -o pipefail; export SPARK_USER=root && unset JAVA_HOME && export PATH=/app/smartngs/bcbio/anaconda/bin:$PATH && gatk-launch -- java-options '-Xms500m -Xmx45864m -Djava.io.tmpdir=/app/datamart/jieyi_bio/sample/work/bcbiotx/tmpf1Mlmd' HaplotypeCallerSpark --reference /app/smartngs/bcbio/genomes/R norvegicus/rn6/ucsc/rn6.2bit --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotatio n RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage -I /app/datamart/jieyi_bio/sample/work/align/3_21Y_rep1/3_21Y_rep1-dedup.splitN.bam -L /app/datamart/jieyi_bio/sample/work/variation/rnaseq/gatk-haplotype/regions/3_21Y _rep1-9_57996107_115555261-gatk-haplotype-regions.bed --interval-set-rule INTERSECTION --spark-master local[16] --conf spark.local.dir=/app/datamart/jieyi_bio/sample/wo rk/bcbiotx/tmpf1Mlmd --conf spark.driver.host=localhost --conf spark.network.timeout=800 --conf spark.executor.heartbeatInterval=100 --annotation ClippingRankSumTest -- annotation DepthPerSampleHC --emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 --output /app/datamart/jieyibio/sample/work/bcbiotx/tmpf1Mlmd/3 21Y_rep1-9_57996107_115555261-gatk-haplotype.vcf at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: java.lang.ArrayIndexOutOfBoundsException: 4 at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypeLikelihoodCalculators.calculateGenotypeCountUsingTables(GenotypeLikelihoodCalculators.java:388) at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypeLikelihoodCalculators.getInstance(GenotypeLikelihoodCalculators.java:263) at org.broadinstitute.hellbender.tools.walkers.genotyper.AlleleSubsettingUtils.subsettedPLIndices(AlleleSubsettingUtils.java:279) at org.broadinstitute.hellbender.tools.walkers.genotyper.AlleleSubsettingUtils.subsetAlleles(AlleleSubsettingUtils.java:61) at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:296) at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:210) at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:15 0)
The root cause is "java.lang.ArrayIndexOutOfBoundsException: 4", I wonder there is any bugs in bcbio pipeline. Would you take a look at this?
Sorry about the issue. This is an non-deterministic error that I've also hit, and submitted a bug report: broadinstitute/gatk#4661 I haven't been able to provide a reproducible case because re-running the same command worked. So a couple of suggestions:
Sorry to not have a definitely solution but hope this helps some.
Hi,
I am also facing the same issue. I've tried re-running the command a few times by changing the number of threads from 10 to 20 to 1: bcbio_nextgen.py config.yaml -n 1
and also changing the java options but every time I realized that it resumes from where it had left off so it doesn't generate the vcf files again. How do I rerun the specific command where it had generated the vcf files?
Here is my config.yaml:
details:
- algorithm:
aligner: star
variantcaller: gatk-haplotype
# tools_off: gatk4
analysis: RNA-seq
description: 'Sample 1: CDL-634-13P'
files:
- /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/fastq/CDL-634-13P/CDL-634-13P_pe_1.fq.gz
- /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/fastq/CDL-634-13P/CDL-634-13P_pe_2.fq.gz
genome_build: GRCh37
metadata:
batch: CDL-634-13P
fc_name: CDL-634-13P-CDLS
upload:
dir: /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/bcbio_out/final
resources:
gatk:
jvm_opts: ["-Xms2g", "-Xmx110g"]
UPDATE: Actually, I just deleted the variation folder in the current working directory and it is rerunning the HaplotypeCaller. Now, running with single core and aforementioned java options.
Thanks for the report and all of the work on debugging this. If you'd like to re-run specific commands, the best way is to look them up in log/bcbio-nextgen-commands.log
and then you can replicate outside of bcbio itself. If bcbio is picking up after some of the generated VCF files, that indicates it finished cleanly on them and should be working.
Practically, the jvm_opts
memory allocation you're providing for GATK looks like it might be problematic. That is the memory allocated per core, so you're providing up to 110 Gb of memory per core used, which will likely overrun available resources when run in parallel with multiple cores supplied. Hopefully scaling this back to the available memory per core on your machine will have better outcomes.
Thanks again for the work debugging this.
Just wanted to update: with -n 1
, all the headers were correctly formed. Thanks.
I am running RNA-seq pipeline on 12 rat samples. It seems GATK failed in the variant calling. The error information is as below: