broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

java.lang.ArrayIndexOutOfBoundsException when creating tabix index #7838

Open bw2 opened 2 years ago

bw2 commented 2 years ago

Bug Report

Affected tool(s) or class(es)

gatk SortVcf

Affected version(s)

Mac OS X 10.16 x86_64; OpenJDK 64-Bit Server VM 1.8.0_322-b06; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.4.1

Description

SortVcf finishes sorting and writes out a VCF, but then fails with java.lang.ArrayIndexOutOfBoundsException when generating the tabix index. To work around this, I can run with --CREATE_INDEX false and then run tabix to generate the index.

INFO    2022-05-06 12:14:45 SortVcf wrote       675,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:41,521,469
INFO    2022-05-06 12:14:45 SortVcf wrote       700,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:61,833,861
INFO    2022-05-06 12:14:45 SortVcf wrote       725,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:78,534,676
INFO    2022-05-06 12:14:45 SortVcf wrote       750,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:100,707,682
INFO    2022-05-06 12:14:45 SortVcf wrote       775,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:117,527,190
INFO    2022-05-06 12:14:45 SortVcf wrote       800,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:134,613,380
INFO    2022-05-06 12:14:45 SortVcf wrote       825,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:153,780,108
INFO    2022-05-06 12:14:45 SortVcf wrote       850,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:173,329,831
INFO    2022-05-06 12:14:46 SortVcf wrote       875,000 records.  Elapsed time: 00:00:03s.  Time for last 25,000:    0s.  Last read position: chr3:192,133,262
[Fri May 06 12:14:46 EDT 2022] picard.vcf.SortVcf done. Elapsed time: 0.36 minutes.
Runtime.totalMemory()=2855272448
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

java.lang.ArrayIndexOutOfBoundsException: 16799
    at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:102)
    at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
    at htsjdk.tribble.index.tabix.TabixIndexCreator.addFeature(TabixIndexCreator.java:92)
    at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:203)
    at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:242)
    at picard.vcf.SortVcf.writeSortedOutput(SortVcf.java:183)
    at picard.vcf.SortVcf.doWork(SortVcf.java:101)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
    at org.broadinstitute.hellbender.Main.main(Main.java:292)

Expected output

There's almost certainly some format issue with my VCF, but ideally GATK would have a better error message than ArrayIndexOutOfBoundsException.

lbergelson commented 2 years ago

@bw2 I agree, this is an unhelpful error. We should fix it but it probably has to be done in htsjdk. (or picard since this is a picard tool we import).

I'm not 100% sure what the issue is, it seems like were somehow resolving an invalid bin in the index. I would expect that that might happen using a very long chromosome, but 193,00,00 shouldn't be too large. Are you using non-human data or something with an extremely long variant?

bw2 commented 2 years ago

Yes, this was human data. It might have been a long variant.

droazen commented 2 years ago

@bw2 Do you have a small file that reproduces this issue? We'll need a runnable test case that reproduces this in order to debug further.

cwhelan commented 2 years ago

I'm not sure if it'll fix or affect this issue, but I noticed this and want to note that @tedsharpe has an active pull request to fix issues with tabix index generation: https://github.com/broadinstitute/gatk/pull/7858