java.lang.IllegalStateException: Key END found in ......

hliu2016 commented 6 years ago

I am running RNA-seq pipeline on 12 rat samples. It seems GATK failed in the variant calling. The error information is as below:

03:22:50.041 INFO GenomicsDBImport - Initializing engine 03:22:50.842 INFO IntervalArgumentCollection - Processing 282763074 bp from intervals 03:22:50.845 INFO GenomicsDBImport - Done initializing engine Created workspace /app/datamart/jieyi_bio/rat_star2/work/joint/gatk-haplotype-joint/1/1/1-1_0_282763074_genomicsdb 03:22:51.698 INFO GenomicsDBImport - Vid Map JSON file will be written to 1-1_0_282763074_genomicsdb/vidmap.json 03:22:51.698 INFO GenomicsDBImport - Callset Map JSON file will be written to 1-1_0_282763074_genomicsdb/callset.json 03:22:51.698 INFO GenomicsDBImport - Complete VCF Header will be written to 1-1_0_282763074_genomicsdb/vcfheader.vcf 03:22:51.698 INFO GenomicsDBImport - Importing to array - 1-1_0_282763074_genomicsdb/genomicsdb_array 03:22:51.710 INFO ProgressMeter - Starting traversal 03:22:51.711 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute 03:22:51.711 INFO GenomicsDBImport - Starting batch input file preload 03:22:51.831 INFO GenomicsDBImport - Finished batch preload 03:22:51.832 INFO GenomicsDBImport - Importing batch 1 with 6 samples 03:22:52.120 INFO GenomicsDBImport - Shutting down engine [April 19, 2018 3:22:52 AM CST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.54 minutes. Runtime.totalMemory()=506986496 java.lang.IllegalStateException: Key END found in VariantContext field INFO at 1:401700 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:173) at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:112) at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:224) at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1368) at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1418) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:509) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) ' returned non-zero exit status 3

roryk commented 6 years ago

Thanks, sorry about that. Could you pass on the VCF file that has the issue so I can take a look at it?

roryk commented 6 years ago

Is this related to #2342 at all?

rdocking commented 6 years ago

Hi guys - I'm running into this issue now using the RNA-Seq pipeline (see related discussion on #2390).

Here's what my setup looks like:

details:
- algorithm:
    adapters:
    - truseq
    - illumina
    aligner: star
    exclude_regions:
    - polyx
    - lcr
    - highdepth
    - altcontigs
    expression_caller:
    - salmon
    mark_duplicates: true
    strandedness: firststrand
    transcriptome_align: false
    trim_reads: atropos
    variantcaller:
    - gatk-haplotype
  analysis: RNA-seq
  description: MOLM13_rep1
  files:
  - /projects/karsanlab/rdocking/KARSANBIO-1254_pipeline/KARSANBIO-1390_rna_seq_runs/data/A34048_1.fq.gz
  - /projects/karsanlab/rdocking/KARSANBIO-1254_pipeline/KARSANBIO-1390_rna_seq_runs/data/A34048_2.fq.gz
  genome_build: hg19
  metadata:
    batch: MOLM13_rep1_batch

(I don't usually run trimming, but I was trying it to see if it helped with the memory issues from the other ticket). In this case, the run passes through the first-stage of GATK variant calling, then crashes like this:

[May 12, 2018 1:37:16 AM PDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=811073536
java.lang.IllegalStateException: Key END found in VariantContext field INFO at chr4:48991 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
        at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:173)
        at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:112)
        at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:224)
        at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1368)
        at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1418)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:509)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
[

The relevant commands run were:

SplitN:

[2018-05-11T23:00Z] eccfx01n1.hpc.bcgsc.ca: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms727m -Xmx3272m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpC8Uo9D' SplitNCigarReads -R /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/seq/hg19.fa -I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/align/MOLM13_rep3/MOLM13_rep3-dedup.bam --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpJw9pVK/MOLM13_rep3-dedup.splitN.bam

HaplotypeCaller:

[2018-05-12T02:16Z] eccfx01n2: export SPARK_USER=rdocking && unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmp617YSx' HaplotypeCallerSpark --reference /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/ucsc/hg19.2bit --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage -I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/align/MOLM13_rep2/MOLM13_rep2-dedup.splitN.bam -L /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bedprep/ref-transcripts-rnaseq_clean-glimit-nolcr-nopolyx.bed --interval-set-rule INTERSECTION --spark-master local[32] --conf spark.local.dir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmp617YSx --annotation ClippingRankSumTest --annotation DepthPerSampleHC --emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmp617YSx/MOLM13_rep2-gatk-haplotype.vcf

GenomicsDBImport:

[2018-05-12T08:43Z] eccfx01n3: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep1_batch/chrM/bcbiotx/tmpbFjdnm' GenomicsDBImport --reader-threads 32 --genomicsdb-workspace-path MOLM13_rep1_batch-chrM_0_16571_genomicsdb -L chrM:1-16571 --variant /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep1-gatk-haplotype.vcf.gz

GenotypeGVCFs:

[2018-05-12T08:43Z] eccfx01n3: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpjJJBx4' GenotypeGVCFs --variant gendb:///projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep1_batch/chrM/MOLM13_rep1_batch-chrM_0_16571_genomicsdb -R /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/seq/hg19.fa --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpoTr60e/MOLM13_rep1_batch-chrM_0_16571.vcf.gz -L chrM:1-16571 --use-new-qual-calculator

The actual command that errored out was this one:

[0:apply]: CalledProcessError: Command 'set -o pipefail; unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep3_batch/chr4/bcbiotx/tmp_GnLP6' GenomicsDBImport --reader-threads 32 --genomicsdb-workspace-path MOLM13_rep3_batch-chr4_0_191154276_genomicsdb -L chr4:1-191154276 --variant /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-gatk-haplotype.vcf.gz
Using GATK jar /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms800m -Xmx94349m -XX:+UseSerialGC -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/joint/gatk-haplotype-joint/MOLM13_rep3_batch/chr4/bcbiotx/tmp_GnLP6 -jar /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar GenomicsDBImport --reader-threads 32 --genomicsdb-workspace-path MOLM13_rep3_batch-chr4_0_191154276_genomicsdb -L chr4:1-191154276 --variant /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-gatk-haplotype.vcf.gz

Here are the first few lines of the VCF:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  MOLM13_rep3
chr1    11869   .       G       <NON_REF>       .       .       END=14470       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
chr1    14471   .       C       <NON_REF>       .       .       END=14476       GT:DP:GQ:MIN_DP:PL      0/0:6:15:5:0,15,182
chr1    14477   .       G       <NON_REF>       .       .       END=14477       GT:DP:GQ:MIN_DP:PL      0/0:9:0:9:0,0,280

It does indeed look like GVCF rather than VCF.

Any ideas? I'm not sure that this isn't a GATK issue rather than a bcbio issue.

hliu2016 commented 6 years ago

@rdocking @roryk This is really a weird problem. I have reproduced this error for quite a few times a month ago. But yesterday I used a subset of the input fastq files to run the pipeline (head 100000 lines), it successfully finished! The output vcf.gz files looks right.

Maybe this is caused by bcbio, not GATK, is it right?

chapmanb commented 6 years ago

Thanks much for following up on this and for the detailed run information. I haven't been able to reproduce this so appreciate the help trying to determine the cases where it happens to we can push a fix to bcbio or GATK, depending on where the problem lies.

Rod, the workflow you describe and file types are exactly right. We first create gVCFs for each sample, then these get loaded in a GenomicsDB (the failing step), then get joint called.

The error messages indicate that this file:

/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-gatk-haplotype.vcf.gz

is missing the END line in the header, which should look like:

##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">

Is that correct? I'm not sure why that would happen, but it would be helpful to confirm and then look into the logs to determine the HaplotypeCallerSpark command that generates that file. If we can reproduce this error with a minimal case outside of bcbio then could try to prepare a bug report for the GATK team. Alternatively, if there is something else going on we can try to track it down. Thanks for the help looking into this.

rdocking commented 6 years ago

Hi Brad - thanks for looking into this. I re-ran this yesterday using the new split-by-chromosome logic, and got caught up at a different step (but it seems to be a similar root cause):

For the same sample, for chromosome 1:

Run GATK:

[2018-05-15T02:20Z] eccfx01n3: export SPARK_USER=rdocking && unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && gatk-launch --java-options '-Xms800m -Xmx94349m -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE' 
HaplotypeCallerSpark --reference /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/ucsc/hg19.2bit 
--annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage 
-I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/align/MOLM13_rep3/MOLM13_rep3-dedup.splitN.bam -L /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep3-chr1-gatk-haplotype-regions-glimit-nolcr-nopolyx.bed 
--interval-set-rule INTERSECTION --spark-master local[32] 
--conf spark.local.dir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE 
--annotation ClippingRankSumTest --annotation DepthPerSampleHC 
--emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 --output /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE/MOLM13_rep3-chr1-gatk-haplotype.vcf

Compress and index:

[2018-05-15T02:39Z] eccfx01n3: cat /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmppIToLE/MOLM13_rep3-chr1-gatk-haplotype.vcf  | /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/bgzip --threads 32 -c > /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpm7_R_E/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz
[2018-05-15T02:39Z] eccfx01n3: /projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin/tabix -f -p vcf /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpL3qNEh/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz

FixHeader:

[2018-05-15T07:25Z] eccfx01n3: zgrep ^# /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz > /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOscSa1/MOLM13_rep3-chr1-gatk-haplotype-fixheader-header.vcf
[2018-05-15T07:25Z] eccfx01n3: unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && picard FixVcfHeader HEADER=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOscSa1/MOLM13_rep3-chr1-gatk-haplotype-fixheader-header.vcf INPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep3-chr1-gatk-haplotype.vcf.gz OUTPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep3-chr1-gatk-haplotype-fixheader.vcf.gz

This time, this step is where it crashes:

[1:apply]: CalledProcessError: Command 'set -o pipefail; unset JAVA_HOME && export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && picard FixVcfHeader HEADER=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOQ0QCk/MOLM13_rep1-chr1-gatk-haplotype-fixheader-header.vcf INPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep1-chr1-gatk-haplotype.vcf.gz OUTPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep1-chr1-gatk-haplotype-fixheader.vcf.gz
23:44:43.965 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/share/picard-2.18.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon May 14 23:44:43 PDT 2018] FixVcfHeader INPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/MOLM13_rep1-chr1-gatk-haplotype.vcf.gz OUTPUT=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/MOLM13_rep1-chr1-gatk-haplotype-fixheader.vcf.gz HEADER=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/bcbiotx/tmpOQ0QCk/MOLM13_rep1-chr1-gatk-haplotype-fixheader-header.vcf    CHECK_FIRST_N_RECORDS=-1 ENFORCE_SAME_SAMPLES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon May 14 23:44:43 PDT 2018] Executing as rdocking@eccfx01n4 on Linux 2.6.32-573.8.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.4-SNAPSHOT
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: AD
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: GQ
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: GT
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: MMQ
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: PGT
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: PID
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: PL
INFO    2018-05-14 23:44:44     FixVcfHeader    FORMAT line found in HEADER will be added to OUTPUT: SB
INFO    2018-05-14 23:44:44     FixVcfHeader    Writing the output VCF.
[Mon May 14 23:44:44 PDT 2018] picard.vcf.FixVcfHeader done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=514850816
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: Key END found in VariantContext field INFO at chr1:11869 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCFheaders by default.
        at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:173)
        at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:112)
        at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:224)
        at picard.vcf.FixVcfHeader.doWork(FixVcfHeader.java:194)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
' returned non-zero exit status 1

In this case, it looks like the intermediate files in bcbiotx aren't there anymore, but the last VCF files that exist are in:

/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/

Which look like, for each chromosome:

> ls -1 | grep 'chr1-' | grep 'rep1'
MOLM13_rep1-chr1-gatk-haplotype-regions.bed
MOLM13_rep1-chr1-gatk-haplotype-regions-glimit.bed
MOLM13_rep1-chr1-gatk-haplotype-regions-glimit-nolcr.bed
MOLM13_rep1-chr1-gatk-haplotype-regions-glimit-nolcr-nopolyx.bed
MOLM13_rep1-chr1-gatk-haplotype.vcf.gz
MOLM13_rep1-chr1-gatk-haplotype.vcf.gz.tbi

These are the INFO fields from those VCF files:

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">

Let me know if there's other information I can provide - here's the complete commands log:

bcbio-nextgen-commands.log

The debug log is ~33M - let me know if there's some part of that that would help.

Thanks again for all your help here!

rdocking commented 6 years ago

OK - looks like this is actually something quite weird: the END info tag is in most (but not all) of the intermediate VCF files:

[2018-05-15 12:14:10]{rdocking@clingen01}/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicates/work/variation/rnaseq/gatk-haplotype/regions/> find MOLM13_rep1-chr*.vcf.gz | xargs zgrep '^##INFO=<ID=END' {}
gzip: {}.gz: No such file or directory
MOLM13_rep1-chr10-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr11-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr12-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr13-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr14-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr15-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr16-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr17-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr18-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr19-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr20-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr21-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr22-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr4-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr5-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr6-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr7-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr8-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chr9-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
MOLM13_rep1-chrX-gatk-haplotype.vcf.gz:##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">

(i.e., it's missing from the chr1-chr3 VCF files, but there everywhere else)

rdocking commented 6 years ago

Re-running the HaplotypeCaller command gives me a VCF with the correct ID=END field in the header - perhaps this is something to do with the cluster jobs running out of memory.

chapmanb commented 6 years ago

Rod; Thanks for following up on this and all of the testing. To try and summarize: the region based approach is working through previous issues with runtimes, but we're getting some outputs with partial headers and missing END. If you re-run those with the same command, the headers are correct, indicating the problem is intermittent. Is that all correct?

If so, I think our next step is to get a correct and incorrect output for a region (with and without the END tag) and then inspect the outputs to look at differences. Ideally we'd do this with direct calls to gatk outside of bcbio. We could then summarize and prepare a bug report for the GATK team, as definitely we don't want to have non-deterministic results and would like to get to the root cause of this issue so we can detect and avoid. Would it be possible to generate those two outputs on your working dataset and share? If you have any tips about how best to generate the problem outputs I can also try to reproduce on my side so I can make these.

Thanks again for the help debugging.

hliu2016 commented 6 years ago

@chapmanb @roryk Hi, I rerun the HaplotypeCallerSpark command to regenerate the vcf files, but they still have no INFO=<ID=END lines. I have seen that in the log file reported the warning message like

[2018-05-16T00:39Z] cnode02: 08:39:49.655 WARN HaplotypeCallerSpark - [2018-05-16T00:39Z] cnode02: Warning: HaplotypeCallerSpark is a BETA tool and is not yet ready for use in production [2018-05-16T00:39Z] cnode02: 08:39:49.655 INFO HaplotypeCallerSpark - Initializing engine [2018-05-16T00:39Z] cnode02: 08:39:49.656 INFO HaplotypeCallerSpark - Done initializing engine [2018-05-16T00:39Z] cnode02: 18/05/16 08:39:50 INFO SparkContext: Submitted application: HaplotypeCallerSpark [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - **** [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - The output of this tool DOES NOT match the output of HaplotypeCaller. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - It is under development and should not be used for production work. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - For evaluation only. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - Use the non-spark HaplotypeCaller if you care about the results. [2018-05-16T00:39Z] cnode02: 08:39:59.224 INFO HaplotypeCallerSpark - ****

So, I guess the beta version of HaplotypeCallerSpark caused this problem. Is there any way to use HaplotypeCaller rather than HaplotypeCallerSpark in bcbio?

rdocking commented 6 years ago

Thanks for following up on this and all of the testing. To try and summarize: the region based approach is working through previous issues with runtimes, but we're getting some outputs with partial headers and missing END. If you re-run those with the same command, the headers are correct, indicating the problem is intermittent. Is that all correct?

Yes - that's what it looks like.

If so, I think our next step is to get a correct and incorrect output for a region (with and without the END tag) and then inspect the outputs to look at differences. Ideally we'd do this with direct calls to gatk outside of bcbio. We could then summarize and prepare a bug report for the GATK team, as definitely we don't want to have non-deterministic results and would like to get to the root cause of this issue so we can detect and avoid. Would it be possible to generate those two outputs on your working dataset and share? If you have any tips about how best to generate the problem outputs I can also try to reproduce on my side so I can make these.

Agreed - I will try to make/document a shareable examples as you suggest.

chapmanb commented 6 years ago

Rod - thanks so much for looking into this, I appreciate the help.

@hliu2016 -- if you use a single core bcbio will use GATK HaplotypeCaller, but it's quite slow which is why we've been working to incorporate the Spark based caller. The practical speeds of the single caller approach make it not very useful for reasonable sized datasets. Our validations of the Spark caller (on DNA variant calling, where we have truth sets) look great:

https://github.com/bcbio/bcbio_validations/tree/master/gatk4 https://github.com/bcbio/bcbio_validations/tree/master/deepvariant#deepvariant-v06-release-strelka2-stratification-and-initial-gatk-cnn

but clearly we have a few issues to work through in terms of stability which is what we're hoping to tackle here.

Thanks again for all the helpful discussion and feedback.

hliu2016 commented 6 years ago

@chapmanb, This problem has been reproduced for quite a few times. Could I use gatk 3.8 to work through this problem?

rdocking commented 6 years ago

@hliu2016 - you can use GATK 3.8 by using tools_off: gatk4 in your config:

http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html?highlight=tools_off#changing-bcbio-defaults

For what it's worth, in my experience using GATK 3.8 within the RNA-Seq pipeline here will be very slow.

rdocking commented 6 years ago

Updated testing results:

I extracted the HaplotypeCaller commands from the bcbio logs, and ran them directly in the same cluster environment.
I confirmed that the original command that was failing (the chr1 job), still gave the improper results when run directly
I confirmed that this issue is still present in GATK 4.0.4.0
I confirmed that the issue seems to relate to the amount of sequence being queried by subsetting the input BED file:
- 20k intervals, ~31Mb: works
- 30k intervals, ~51Mb: works
- 40k intervals, ~71Mb: works
- 50k intervals, ~89Mb: works
- 60k intervals, ~109Mb: works
- 70k intervals (all chr1), ~125Mb: fails
This seems to be related to the amount of sequence, not the number of intervals. Using the input BED file without removal of the low-complexity regions and polyX regions also fails:
- 3,252 intervals, ~128Mb: fails.

This run is from a subset BAM file that I made to try to reproduce the issue more quickly. The BAM file is ~3G, and is shareable if need be. I'm attaching the cluster log output for the 60k job (that succeeeded) and the 70k job (that failed). I can't see anything notable, but I don't have a ton of experience with GATK.

gatk_debug_60k.txt gatk_debug_70k.txt

Here's the submission command for the jobs:

export SPARK_USER=rdocking && \
unset JAVA_HOME && \
export PATH=/projects/rdocking_prj/software/bcbio-nextgen/data/anaconda/bin:$PATH && \
gatk-launch \
--java-options '-Xms800m -Xmx94349m -Djava.io.tmpdir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/debug/' \
HaplotypeCallerSpark \
--reference /projects/rdocking_prj/software/bcbio-nextgen/data/genomes/Hsapiens/hg19/ucsc/hg19.2bit \
--annotation MappingQualityRankSumTest --annotation MappingQualityZero \
--annotation QualByDepth --annotation ReadPosRankSumTest \
--annotation RMSMappingQuality --annotation BaseQualityRankSumTest \
--annotation FisherStrand --annotation MappingQuality \
--annotation DepthPerAlleleBySample --annotation Coverage \
-I /projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/work/align/MOLM13_rep1/MOLM13_rep1-dedup.splitN.bam \
-L /projects/karsanlab/rdocking/KARSANBIO-1254_pipeline/KARSANBIO-1390_rna_seq_runs/data/gatk_debug/chr1_70k.bed \
--interval-set-rule INTERSECTION \
--spark-master local[12] \
--conf spark.local.dir=/projects/karsanscratch/rdocking/KARSANBIO-1390_rna_seq_runs/molm13_replicate_one_small/debug \
--conf spark.driver.host=localhost \
--conf spark.network.timeout=800 \
--conf spark.executor.heartbeatInterval=100 \
--annotation ClippingRankSumTest --annotation DepthPerSampleHC \
--emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 \
--output MOLM13_rep1-chr1-70k-gatk-haplotype.vcf

Thanks!

rdocking commented 6 years ago

One more note: I scaled the java memory request up a bit further for the failing job to:

--java-options '-Xms800m -Xmx110g

... and it produced a properly formed header. I'll see if I can get the equivalent memory request into my bcbio config and work past the issue that way.

hliu2016 commented 6 years ago

@rdocking, Thanks very much for sharing your experience in this problem. I have also tested that more memory allocation is helpful to produce proper header in the vcf.gz files, but are not sure to what extent the Xmx parameter should be set for different size bam files. I look forward to a systematic solution in the next release of gatk.

rdocking commented 6 years ago

Submitted an issue on the GATK side - https://github.com/broadinstitute/gatk/issues/4821

Not sure if there's much that can be done on the bcbio side to guard against this until it's fixed upstream.

chapmanb commented 6 years ago

Rod -- thanks so much for all this incredible debugging and for submitting the issue upstream to the GATK folks. This helps so much, and I'm 100% agreed that fixing in GATK will provide the most benefit to the most people.

As a temporary way to avoid the issue I've tried to improve the parallelization by chromosome to now split into smaller blocks. Based on your analysis, I hope this will avoid large blocks, result in less memory requirements, and remove the incorrect outputs. This is all implemented in the latest development version and you shouldn't need any changes/adjustment to your configuration.

If you have an opportunity to test this, please let us know if it helps. We could tweak splitting more aggressively if needed as I tried to guess some on what the best number of splits would be.

Thank you again for all the work and help on this.

rdocking commented 6 years ago

Confirmed that this works with my setup - the jobs are split into smaller blocks, and all the VCF files were produced with propoer header.

Thanks a lot Brad for helping us work through this!

chapmanb commented 6 years ago

Rod -- nice one, thanks for the positive report. Glad it's now working for you and I'll close this for now. Hopefully it will also get fixed upstream in GATK to avoid any future issues but glad to have analyses going.

hliu2016 commented 6 years ago

@chapmanb Thanks very much for your nice work. I further hope GATK fix this problem.

hliu2016 commented 6 years ago

@chapmanb, I have just tested the updated codes, it seems that there are still some bugs for GATK 4.0 in HaplotypeCallerSpark step. I am running the pipeline on 12 rat samples, and the error is as below:

CalledProcessError: Command 'set -o pipefail; export SPARK_USER=root && unset JAVA_HOME && export PATH=/app/smartngs/bcbio/anaconda/bin:$PATH && gatk-launch -- java-options '-Xms500m -Xmx45864m -Djava.io.tmpdir=/app/datamart/jieyi_bio/sample/work/bcbiotx/tmpf1Mlmd' HaplotypeCallerSpark --reference /app/smartngs/bcbio/genomes/R norvegicus/rn6/ucsc/rn6.2bit --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotatio n RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage -I /app/datamart/jieyi_bio/sample/work/align/3_21Y_rep1/3_21Y_rep1-dedup.splitN.bam -L /app/datamart/jieyi_bio/sample/work/variation/rnaseq/gatk-haplotype/regions/3_21Y _rep1-9_57996107_115555261-gatk-haplotype-regions.bed --interval-set-rule INTERSECTION --spark-master local[16] --conf spark.local.dir=/app/datamart/jieyi_bio/sample/wo rk/bcbiotx/tmpf1Mlmd --conf spark.driver.host=localhost --conf spark.network.timeout=800 --conf spark.executor.heartbeatInterval=100 --annotation ClippingRankSumTest -- annotation DepthPerSampleHC --emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 --output /app/datamart/jieyibio/sample/work/bcbiotx/tmpf1Mlmd/3 21Y_rep1-9_57996107_115555261-gatk-haplotype.vcf at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: java.lang.ArrayIndexOutOfBoundsException: 4 at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypeLikelihoodCalculators.calculateGenotypeCountUsingTables(GenotypeLikelihoodCalculators.java:388) at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypeLikelihoodCalculators.getInstance(GenotypeLikelihoodCalculators.java:263) at org.broadinstitute.hellbender.tools.walkers.genotyper.AlleleSubsettingUtils.subsettedPLIndices(AlleleSubsettingUtils.java:279) at org.broadinstitute.hellbender.tools.walkers.genotyper.AlleleSubsettingUtils.subsetAlleles(AlleleSubsettingUtils.java:61) at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:296) at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:210) at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:15 0)

The root cause is "java.lang.ArrayIndexOutOfBoundsException: 4", I wonder there is any bugs in bcbio pipeline. Would you take a look at this?

chapmanb commented 6 years ago

Sorry about the issue. This is an non-deterministic error that I've also hit, and submitted a bug report: broadinstitute/gatk#4661 I haven't been able to provide a reproducible case because re-running the same command worked. So a couple of suggestions:

Try re-running in place to see if a re-run works on your sample.
Respond to that GATK issue to let them know you're also seeing the problem, just so they know it's affecting multiple people. Maybe someone will be able to give us ideas to work around it.

Sorry to not have a definitely solution but hope this helps some.

komalsrathi commented 6 years ago

Hi,

I am also facing the same issue. I've tried re-running the command a few times by changing the number of threads from 10 to 20 to 1: bcbio_nextgen.py config.yaml -n 1 and also changing the java options but every time I realized that it resumes from where it had left off so it doesn't generate the vcf files again. How do I rerun the specific command where it had generated the vcf files?

Here is my config.yaml:

details:
- algorithm:
    aligner: star
    variantcaller: gatk-haplotype
    # tools_off: gatk4
  analysis: RNA-seq
  description: 'Sample 1: CDL-634-13P'
  files:
  - /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/fastq/CDL-634-13P/CDL-634-13P_pe_1.fq.gz
  - /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/fastq/CDL-634-13P/CDL-634-13P_pe_2.fq.gz
  genome_build: GRCh37
  metadata:
    batch: CDL-634-13P
fc_name: CDL-634-13P-CDLS
upload:
  dir: /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/bcbio_out/final
resources:
  gatk:
    jvm_opts: ["-Xms2g", "-Xmx110g"]

UPDATE: Actually, I just deleted the variation folder in the current working directory and it is rerunning the HaplotypeCaller. Now, running with single core and aforementioned java options.

chapmanb commented 6 years ago

Thanks for the report and all of the work on debugging this. If you'd like to re-run specific commands, the best way is to look them up in log/bcbio-nextgen-commands.log and then you can replicate outside of bcbio itself. If bcbio is picking up after some of the generated VCF files, that indicates it finished cleanly on them and should be working.

Practically, the jvm_opts memory allocation you're providing for GATK looks like it might be problematic. That is the memory allocated per core, so you're providing up to 110 Gb of memory per core used, which will likely overrun available resources when run in parallel with multiple cores supplied. Hopefully scaling this back to the available memory per core on your machine will have better outcomes.

Thanks again for the work debugging this.

komalsrathi commented 6 years ago

Just wanted to update: with -n 1, all the headers were correctly formed. Thanks.

bcbio / bcbio-nextgen

java.lang.IllegalStateException: Key END found in ...... #2375