broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 589 forks source link

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770 #8192

Closed shuaiwang2 closed 1 year ago

shuaiwang2 commented 1 year ago

Hello, When I implement "HaplotypeCaller" commands, the reference genome is about 15G , every chromosome is more then 600M, I get some errors, could you give me some advice? the commands

# the step is after marked duplication 
samtools index -c  markdup.bam.gz
gatk --java-options "-Xmx100G -Djava.io.tmpdir=./" HaplotypeCaller -R Triticum_aestivum.IWGSC.dna.toplevel.fa -I rmarkdup.bam.gz -O SRR9851087.gvcf.gz   -ERC GVCF  -OVI >3gvcf.log 2>&1

the bug:

Using GATK jar /home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx100G -Djava.io.tmpdir=./ -jar /home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R Triticum_aestivum.IWGSC.dna.
toplevel.fa -I rmarkdup.bam.gz -O SRR9851087.gvcf.gz -ERC GVCF -OVI
09:01:25.845 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.950 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0
09:01:25.950 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
09:01:25.950 INFO  HaplotypeCaller - Executing as ywt@ywt-Precision-5820-Tower on Linux v5.15.0-41-generic amd64
09:01:25.950 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.5+8-Ubuntu-2ubuntu122.04
09:01:25.950 INFO  HaplotypeCaller - Start Date/Time: February 8, 2023 at 9:01:25 AM CST
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Version: 3.0.1
09:01:25.951 INFO  HaplotypeCaller - Picard Version: 2.27.5
09:01:25.951 INFO  HaplotypeCaller - Built for Spark Version: 2.4.5
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:01:25.951 INFO  HaplotypeCaller - Deflater: IntelDeflater
09:01:25.951 INFO  HaplotypeCaller - Inflater: IntelInflater
09:01:25.951 INFO  HaplotypeCaller - GCS max retries/reopens: 20
09:01:25.951 INFO  HaplotypeCaller - Requester pays: disabled
09:01:25.952 INFO  HaplotypeCaller - Initializing engine
09:01:26.059 INFO  HaplotypeCaller - Done initializing engine
09:01:26.060 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
09:01:26.067 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
09:01:26.067 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
09:01:26.077 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
09:01:26.078 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
09:01:26.089 INFO  IntelPairHmm - Using CPU-supported AVX-512 instructions
09:01:26.089 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
09:01:26.090 INFO  IntelPairHmm - Available threads: 36
09:01:26.090 INFO  IntelPairHmm - Requested threads: 4
09:01:26.090 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
09:01:26.121 INFO  ProgressMeter - Starting traversal
09:01:26.121 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
09:01:26.406 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 1A:145 and possibly subsequent; at least 10 samples must have called genotypes
09:01:33.373 WARN  DepthPerSampleHC - Annotation will not be calculated at position 1A:1702502 and possibly subsequent; genotype for sample SRR9851087 is not called
09:01:33.374 WARN  StrandBiasBySample - Annotation will not be calculated at position 1A:1702502 and possibly subsequent; genotype for sample SRR9851087 is not called
09:01:36.316 INFO  ProgressMeter -           1A:2054431              0.2                  7310          43025.3
09:01:46.831 INFO  ProgressMeter -           1A:3580946              0.3                 12960          37547.1
09:01:56.858 INFO  ProgressMeter -           1A:4888859              0.5                 17840          34824.5
09:02:07.416 INFO  ProgressMeter -           1A:7184455              0.7                 26090          37907.7
09:02:17.850 INFO  ProgressMeter -           1A:9469826              0.9                 34580          40109.0
09:02:28.162 INFO  ProgressMeter -          1A:11632942              1.0                 42480          41082.5
09:02:38.391 INFO  ProgressMeter -          1A:12861813              1.2                 47220          39203.0
09:02:48.460 INFO  ProgressMeter -          1A:15373965              1.4                 56590          41236.8

09:20:43.520 INFO  ProgressMeter -         1A:536836177             19.3               1951140         101147.8
09:20:44.715 INFO  HaplotypeCaller - Shutting down engine
[February 8, 2023 at 9:20:44 AM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 19.31 minutes.
Runtime.totalMemory()=1811939328
java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770
        at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:142)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
        at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:177)
        at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:233)
        at org.broadinstitute.hellbender.utils.variant.writers.GVCFWriter.close(GVCFWriter.java:71)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.closeTool(HaplotypeCaller.java:277)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1101)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
(END)
lbergelson commented 1 year ago

@shuaiwang2 Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).

It might be possible to work around this issue by setting --create-output-variant-index false, although downstream gatk tools would need an index if you're sharding them.

Otherwise I recommend splitting your chromosomes into two separate parts and calling on the split chromosomes. Splitting along a long region of N's should be a safe way to avoid missing any useful calls. (The telemere might be a good spot unless you have a T2T reference.).

We should probably improve that error message to make it clear what the problem is.

shuaiwang2 commented 1 year ago

thank you, very useful advice for me

lbergelson commented 1 year ago

I've opened a ticket (https://github.com/samtools/htsjdk/issues/1651) to improve this error message and make it less confusing