`HaplotypeCaller` in linked DeBruijn mode + pileup mode explodes with `java.lang.NegativeArraySizeException` or exit code `247` (OOM)

jonn-smith commented 1 year ago

I'm working on some Plasmodium falciparum callsets in GATK and I have come across a curious error:

Using GATK wrapper script /juffowup/gatk/build/install/gatk/bin/gatk
Running:
    /juffowup/gatk/build/install/gatk/bin/gatk HaplotypeCaller -R /juffowup2/malaria/references/PlasmoDB-61_Pfalciparum3D7_Genome.fasta -I /juffowup2/malaria/haplotypecaller_arg_testing/fixed_bam/PG0004-CW.aligned.merged.markDuplicates.sorted.BQSR.bam -O /juffowup2/malaria/haplotypecaller_arg_testing/PG0004-CW.haplotype_caller.fixed_bam_file.with_pileup.g.vcf.gz --bam-output /juffowup2/malaria/haplotypecaller_arg_testing/PG0004-CW.haplotype_caller.fixed_bam_file.with_pileup.bamout.bam -contamination 0 --sample-ploidy 2 --linked-de-bruijn-graph --pileup-detection true --pileup-detection-enable-indel-pileup-calling true --max-reads-per-alignment-start 20 --annotate-with-num-discovered-alleles -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 -G StandardAnnotation -G StandardHCAnnotation -ERC GVCF --verbosity INFO
14:14:15.323 WARN  GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
14:14:15.328 WARN  GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardHCAnnotation) is enabled for this tool by default
14:14:15.388 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/juffowup/gatk/build/install/gatk/lib/gkl-0.8.11.jar!/com/intel/gkl/native/libgkl_compression.so
14:14:15.435 INFO  HaplotypeCaller - ------------------------------------------------------------
14:14:15.439 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.4.0.0-44-g1529aa1-SNAPSHOT
14:14:15.439 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
14:14:15.439 INFO  HaplotypeCaller - Executing as jonn@dsde-methods-jonn-juffowup on Linux v5.4.0-1104-gcp amd64
14:14:15.439 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.7+7
14:14:15.440 INFO  HaplotypeCaller - Start Date/Time: July 26, 2023 at 2:14:15 PM UTC
...
22:15:34.977 INFO  HaplotypeCaller - Shutting down engine
[July 26, 2023 at 10:15:34 PM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 481.33 minutes.
Runtime.totalMemory()=47982837760
java.lang.NegativeArraySizeException: -896617256
        at org.broadinstitute.hellbender.utils.pairhmm.VectorLoglessPairHMM.computeLog10Likelihoods(VectorLoglessPairHMM.java:131)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeReadLikelihoods(PairHMMLikelihoodCalculationEngine.java:272)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeReadLikelihoods(PairHMMLikelihoodCalculationEngine.java:197)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeReadLikelihoods(PairHMMLikelihoodCalculationEngine.java:177)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:790)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:269)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:200)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
        at org.broadinstitute.hellbender.Main.main(Main.java:306)

real    481m24.418s
user    581m54.752s
sys     2m49.965s

This run did not complete successfully - the Exception caused it to fail prematurely.

Previously I had seen HaplotypeCaller run out of memory and fail in almost as much time, so I think this and the OOM error are related. The only difference in invocation was that with the OOM failure, I was running with the default for --max-reads-per-alignment-start (50). This also works just fine with that setting at 15. The failure seems to occur around the same place in the data each time (the end of chr13). At that point in the data, there is a very large pileup which is probably instigating this. Additionally, if I remove the --linked-de-bruijn-graph argument, this runs just fine with the default setting of --max-reads-per-alignment-start. I have a minimally reproductive dataset that I can share which reproduces the OOM error for sure (I'm 99% sure it reproduces this one as well).

For the OOM failures, the final logs from HaplotypeCaller look like this:

./gatk HaplotypeCaller ...
...
15:56:23.205 INFO  ProgressMeter -  Pf3D7_13_v3:2603234            100.5                114070           1134.5
15:56:33.443 INFO  ProgressMeter -  Pf3D7_13_v3:2661462            100.7                114420           1136.1
Dangling End recovery killed because of a loop (getReferencePathForwardFromKmer)
15:56:43.998 INFO  ProgressMeter -  Pf3D7_13_v3:2730055            100.9                114840           1138.3
15:56:59.911 INFO  ProgressMeter -  Pf3D7_13_v3:2798281            101.2                115210           1139.0
15:59:27.062 INFO  ProgressMeter -  Pf3D7_13_v3:2861780            103.6                115460           1114.4
Dangling End recovery killed because of a loop (getReferencePathForwardFromKmer)
Dangling End recovery killed because of a loop (getReferencePathForwardFromKmer)
15:59:37.457 INFO  ProgressMeter -  Pf3D7_13_v3:2869697            103.8                115500           1112.9

real    671m24.770s
user    777m30.923s
sys     6m13.682s

$ echo $?
247

Here is my command-line invocation:

./gatk --java-options "-Xmx100000m -Xms25000m" \
  HaplotypeCaller \
    -R /juffowup2/malaria/references/PlasmoDB-61_Pfalciparum3D7_Genome.fasta \
    -I ${WORKING_DIR}/fixed_bam/PG0004-CW.aligned.merged.markDuplicates.sorted.BQSR.bam \
    -O ${WORKING_DIR}/PG0004-CW.haplotype_caller.fixed_bam_file.with_pileup.g.vcf.gz \
    --bam-output ${WORKING_DIR}/PG0004-CW.haplotype_caller.fixed_bam_file.with_pileup.bamout.bam \
    -contamination 0 \
    --sample-ploidy 2 \
    --linked-de-bruijn-graph \
    --pileup-detection true \
    --pileup-detection-enable-indel-pileup-calling true \
    --max-reads-per-alignment-start 20 \
    --annotate-with-num-discovered-alleles \
    -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
    -G StandardAnnotation -G StandardHCAnnotation  \
    -ERC GVCF \
    --verbosity INFO \

jamesemery commented 1 year ago

well looks like the offending line is this: mLogLikelihoodArray = new double[readListSize * numHaplotypes]; //to store results

I don't like that those two numbers are overflowing... numHaplotypes should really be getting bounded to ~256 and readListSize similarly should be capped by the downsampling. This warrants investigation

droazen commented 1 year ago

@jamesemery @jonn-smith So the product of the number of reads and the number of haplotypes is exceeding Integer.MAX_VALUE (2147483647). If we assume that the number of haplotypes is bounded at 256, then this implies at least ~8,388,607 reads in the region. But --max-assembly-region-size defaults to 300, with 100 bases of padding on each side, so 500 bases total per region. And --max-reads-per-alignment-start defaults to 50, giving a theoretical maximum of 25,000 reads per region. If we instead assume that the reads are indeed capped at 25k but the haplotypes are not capped, it would require about ~86k haplotypes to produce overflow.

Not sure what might be different about the --linked-de-bruijn-graph codepath that could produce such an explosion of either reads or haplotypes....

lbergelson commented 1 year ago

One again the limits of using natural numbers for array bounds crops up. When will java expand to cover even integer valued arrays? I cant wait for project Euler which introduces complex array bounds as part of the expanded type system.

broadinstitute / gatk

`HaplotypeCaller` in linked DeBruijn mode + pileup mode explodes with `java.lang.NegativeArraySizeException` or exit code `247` (OOM) #8440