broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

How to avoid java.lang.ArrayIndexOutOfBoundsException when indexing a vcf.gz file? #8747

Open erah1 opened 8 months ago

erah1 commented 8 months ago

Hello,

Could you help me with this? I ran this code:

prg=/home/user1/Programs/gatk-4.5.0.0
log_dir=/home/user1/Programs/logs
java -Xmx64g -XX:ParallelGCThreads=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
     -jar ${prg}/gatk-package-4.5.0.0-local.jar IndexFeatureFile -I ${dir}/snp_allsamples.vcf.gz \
     --output snp_allsamples.vcf.tbi \
     2>${log_dir}/snp_allsamples_gvcf_index.err

and I received the following error message

09:36:35.254 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/Programs/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:36:35.386 INFO  IndexFeatureFile - ------------------------------------------------------------
09:36:35.389 INFO  IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.5.0.0
09:36:35.389 INFO  IndexFeatureFile - For support and documentation go to https://software.broadinstitute.org/gatk/
09:36:35.389 INFO  IndexFeatureFile - Executing as user1@xxx.xx on Linux v5.4.0-150-generic amd64
09:36:35.389 INFO  IndexFeatureFile - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
09:36:35.389 INFO  IndexFeatureFile - Start Date/Time: March 21, 2024 at 9:36:35 a.m. CST
09:36:35.390 INFO  IndexFeatureFile - ------------------------------------------------------------
09:36:35.390 INFO  IndexFeatureFile - ------------------------------------------------------------
09:36:35.390 INFO  IndexFeatureFile - HTSJDK Version: 4.1.0
09:36:35.391 INFO  IndexFeatureFile - Picard Version: 3.1.1
09:36:35.391 INFO  IndexFeatureFile - Built for Spark Version: 3.5.0
09:36:35.391 INFO  IndexFeatureFile - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:36:35.391 INFO  IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:36:35.392 INFO  IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:36:35.392 INFO  IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:36:35.392 INFO  IndexFeatureFile - Deflater: IntelDeflater
09:36:35.392 INFO  IndexFeatureFile - Inflater: IntelInflater
09:36:35.392 INFO  IndexFeatureFile - GCS max retries/reopens: 20
09:36:35.392 INFO  IndexFeatureFile - Requester pays: disabled
09:36:35.393 INFO  IndexFeatureFile - Initializing engine
09:36:35.393 INFO  IndexFeatureFile - Done initializing engine
09:36:35.502 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/user1/snp_allsamples.vcf.gz
09:36:35.518 INFO  ProgressMeter - Starting traversal
09:36:35.518 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Records Processed   Records/Minute
09:36:36.979 INFO  IndexFeatureFile - Shutting down engine
[March 21, 2024 at 9:36:36 a.m. CST] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1241513984
java.lang.ArrayIndexOutOfBoundsException: Index 37451 out of bounds for length 37451
        at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:102)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.addFeature(TabixIndexCreator.java:92)
        at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:529)
        at htsjdk.tribble.index.IndexFactory.createTabixIndex(IndexFactory.java:476)
        at htsjdk.tribble.index.IndexFactory.createTabixIndex(IndexFactory.java:502)
        at htsjdk.tribble.index.IndexFactory.createIndex(IndexFactory.java:403)
        at org.broadinstitute.hellbender.tools.IndexFeatureFile.createAppropriateIndexInMemory(IndexFeatureFile.java:109)
        at org.broadinstitute.hellbender.tools.IndexFeatureFile.doWork(IndexFeatureFile.java:75)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
        at org.broadinstitute.hellbender.Main.main(Main.java:306)

Thank you

Instructions

The github issue tracker is for bug reports, feature requests, and API documentation requests. General questions about how to use the GATK, how to interpret the output, etc. should be asked on the official support forum.


Bug Report

Affected tool(s) or class(es)

Tool/class name(s), special parameters?

Affected version(s)

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.

Steps to reproduce

Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem. (The support team may follow up to ask you to upload data to reproduce the issue.)

Expected behavior

Tell us what should happen

Actual behavior

Tell us what happens instead


Feature request

Tool(s) or class(es) involved

Tool/class name(s), special parameters?

Description

Specify whether you want a modification of an existing behavior or addition of a new capability. Provide examples, screenshots, where appropriate.


Documentation request

Tool(s) or class(es) involved

Tool/class name(s), parameters?

Description

Describe what needs to be added or modified.


evanizer8 commented 2 months ago

Probably this is to late to be of any help, but I had the exact same issue, down to the index it prints out as problematic. Maybe others will stumble upon this and find the issue here as I have. I found some pertinent info here: https://gatk.broadinstitute.org/hc/en-us/community/posts/12862204385051-Is-it-feasible-to-use-the-extracted-vcf-gz-file-for-CombineGVCFs-and-GenotypeGVCFs

Though it seems like they never got around to a more useful stdout message. Anyway, I did as advised and split the chromosome sizes (because I'm working with barley, and the seq lengths are > 2^19)

BUT- when I try indexing the bgzipped second "halves" of each chromosome with IndexFeatureFile, I get the same message again! When they're not bgzipped, however, it actually works.