broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 590 forks source link

Reference isn't being parsed/detected correctly when using CalibrateDragstrModel in parallel mode #8139

Open nvnieuwk opened 1 year ago

nvnieuwk commented 1 year ago

Instructions

The github issue tracker is for bug reports, feature requests, and API documentation requests. General questions about how to use the GATK, how to interpret the output, etc. should be asked on the official support forum.


Bug Report

Affected tool(s) or class(es)

GATK CalibrateDragstrModel

Affected version(s)

Description

When running CalibrateDragstrModel in parallel mode, the supplied reference isn't detected correctly causing the following error stack trace:

Using GATK jar /usr/local/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx72g -jar /usr/local/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar CalibrateDragstrModel --input input.cram --output input.txt --reference hg38.fa --str-table-path hg38.zip --threads 12 --intervals fasta_bed.bed --tmp-dir .
10:24:21.117 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
10:24:21.289 INFO  CalibrateDragstrModel - ------------------------------------------------------------
10:24:21.289 INFO  CalibrateDragstrModel - The Genome Analysis Toolkit (GATK) v4.3.0.0
10:24:21.289 INFO  CalibrateDragstrModel - For support and documentation go to https://software.broadinstitute.org/gatk/
10:24:21.289 INFO  CalibrateDragstrModel - Executing as nvnieuwk on Linux v4.18.0-372.36.1.el8_6.x86_64 amd64
10:24:21.289 INFO  CalibrateDragstrModel - Java runtime: OpenJDK 64-Bit Server VM v11.0.15-internal+0-adhoc..src
10:24:21.289 INFO  CalibrateDragstrModel - Start Date/Time: January 2, 2023 at 10:24:21 AM GMT
10:24:21.289 INFO  CalibrateDragstrModel - ------------------------------------------------------------
10:24:21.289 INFO  CalibrateDragstrModel - ------------------------------------------------------------
10:24:21.290 INFO  CalibrateDragstrModel - HTSJDK Version: 3.0.1
10:24:21.290 INFO  CalibrateDragstrModel - Picard Version: 2.27.5
10:24:21.290 INFO  CalibrateDragstrModel - Built for Spark Version: 2.4.5
10:24:21.290 INFO  CalibrateDragstrModel - HTSJDK Defaults.COMPRESSION_LEVEL : 2
10:24:21.290 INFO  CalibrateDragstrModel - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:24:21.290 INFO  CalibrateDragstrModel - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:24:21.290 INFO  CalibrateDragstrModel - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:24:21.290 INFO  CalibrateDragstrModel - Deflater: IntelDeflater
10:24:21.290 INFO  CalibrateDragstrModel - Inflater: IntelInflater
10:24:21.291 INFO  CalibrateDragstrModel - GCS max retries/reopens: 20
10:24:21.291 INFO  CalibrateDragstrModel - Requester pays: disabled
10:24:21.291 INFO  CalibrateDragstrModel - Initializing engine
10:24:21.937 INFO  FeatureManager - Using codec BEDCodec to read file file:///kyukon/scratch/gent/vo/000/gvo00082/vsc44804/nxf.4BNO5qL7DM/fasta_bed.bed
10:24:21.962 INFO  IntervalArgumentCollection - Processing 3217346917 bp from intervals
10:24:22.008 INFO  CalibrateDragstrModel - Running in parallel using the requested number of threads: 12
10:24:22.008 INFO  CalibrateDragstrModel - Done initializing engine
10:24:22.008 INFO  ProgressMeter - Starting traversal
10:24:22.008 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Records Processed   Records/Minute
10:24:32.859 INFO  ProgressMeter -        chr1:26000000              0.2                 59038         326477.4
10:24:42.867 INFO  ProgressMeter -        chr1:83000000              0.3                184245         529998.1
10:24:52.965 INFO  ProgressMeter -       chr1:137193529              0.5                306766         594565.4
10:25:03.307 INFO  ProgressMeter -       chr1:193193529              0.7                428759         622924.6
10:25:13.318 INFO  ProgressMeter -         chr2:3237107              0.9                564835         660497.0
10:25:23.358 INFO  ProgressMeter -        chr2:57237107              1.0                681209         666219.1
10:25:33.392 INFO  ProgressMeter -       chr2:109237107              1.2                799610         672091.8
10:25:44.527 INFO  ProgressMeter -       chr2:177512416              1.4                930822         676805.6
10:25:54.821 INFO  ProgressMeter -       chr2:237512416              1.5               1069999         691712.8
10:26:04.863 INFO  ProgressMeter -        chr3:54999378              1.7               1197525         698570.8
10:26:09.642 INFO  CalibrateDragstrModel - Shutting down engine
[January 2, 2023 at 10:26:09 AM GMT] org.broadinstitute.hellbender.tools.dragstr.CalibrateDragstrModel done. Elapsed time: 1.81 minutes.
Runtime.totalMemory()=47647293440
java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Requested start 8613 is beyond the sequence length HLA-DRB1*04:03:01
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
    at java.base/java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1006)
    at org.broadinstitute.hellbender.utils.Utils.runInParallel(Utils.java:1479)
    at org.broadinstitute.hellbender.tools.dragstr.CalibrateDragstrModel.collectCaseStatsParallel(CalibrateDragstrModel.java:551)
    at org.broadinstitute.hellbender.tools.dragstr.CalibrateDragstrModel.traverse(CalibrateDragstrModel.java:202)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1095)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Requested start 8613 is beyond the sequence length HLA-DRB1*04:03:01
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
    at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
    at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:737)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:919)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
    at java.base/java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:558)
    at org.broadinstitute.hellbender.tools.dragstr.CalibrateDragstrModel.lambda$collectCaseStatsParallel$14(CalibrateDragstrModel.java:568)
    at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1448)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.IllegalArgumentException: Requested start 8613 is beyond the sequence length HLA-DRB1*04:03:01
    at htsjdk.samtools.cram.ref.ReferenceSource.getReferenceBasesByRegion(ReferenceSource.java:207)
    at htsjdk.samtools.cram.build.CRAMReferenceRegion.fetchReferenceBasesByRegion(CRAMReferenceRegion.java:169)
    at htsjdk.samtools.cram.structure.Slice.normalizeCRAMRecords(Slice.java:502)
    at htsjdk.samtools.cram.structure.Container.getSAMRecords(Container.java:322)
    at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:112)
    at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:204)
    at htsjdk.samtools.CRAMFileReader$CRAMIntervalIteratorBase.getNextRecord(CRAMFileReader.java:589)
    at htsjdk.samtools.CRAMFileReader$CRAMIntervalIteratorBase.initializeIterator(CRAMFileReader.java:562)
    at htsjdk.samtools.CRAMFileReader$CRAMIntervalIterator.<init>(CRAMFileReader.java:620)
    at htsjdk.samtools.CRAMFileReader$CRAMIntervalIterator.<init>(CRAMFileReader.java:615)
    at htsjdk.samtools.CRAMFileReader.query(CRAMFileReader.java:487)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:550)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:417)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:130)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.<init>(SamReaderQueryingIterator.java:69)
    at org.broadinstitute.hellbender.engine.ReadsPathDataSource.prepareIteratorsForTraversal(ReadsPathDataSource.java:412)
    at org.broadinstitute.hellbender.engine.ReadsPathDataSource.prepareIteratorsForTraversal(ReadsPathDataSource.java:389)
    at org.broadinstitute.hellbender.engine.ReadsPathDataSource.query(ReadsPathDataSource.java:352)
    at org.broadinstitute.hellbender.tools.dragstr.CalibrateDragstrModel.readStream(CalibrateDragstrModel.java:915)
    at org.broadinstitute.hellbender.tools.dragstr.CalibrateDragstrModel.lambda$null$11(CalibrateDragstrModel.java:556)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at org.broadinstitute.hellbender.tools.dragstr.InterleavingListSpliterator.forEachRemaining(InterleavingListSpliterator.java:87)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952)
    at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926)
    at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
    at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
    ... 5 more

However it does work when running the tool single threaded with the exact same options.

Steps to reproduce

I've sadly been unable to create a reproducible example. I've only encountered this with non-public data which I can't share here. I'd be happy to run tests for you though.

droazen commented 1 year ago

@vruano Your thoughts on this one?

droazen commented 1 year ago

@nvnieuwk Can you check the sequence dictionaries for your reference and cram to see what the length of the "HLA-DRB1*04:03:01" contig is reported as? It should be 15246 for the hg38 reference.

You can check the reference sequence dictionary by searching the .dict file for the contig name, and you can check the cram dictionary by inspecting the entry for that contig in the cram header.

nvnieuwk commented 1 year ago

In the .dict file it shows the correct length:

@SQ SN:HLA-DRB1*04:03:01    LN:15246    UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa  AS:GRCh38   M5:ce0de8afd561fb1fb0d3acce94386a27 SP:Human

And in the cram header it shows the same length:

@SQ SN:HLA-DRB1*04:03:01    LN:15246    AH:*    M5:ce0de8afd561fb1fb0d3acce94386a27 UR:/kyukon/data/gent/shared/000/gvo00082/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
vruano commented 1 year ago

@droazen make sense to check the .fai file and the fasta itself just in case or is .dict guaratee t be the only source for contig lengths at this point.

vruano commented 1 year ago

Looking at the htsjdk code responsible for the original throw (as far as I can see in the stack enclosed in the description) there is a few "smells" in the way synchronized is used or not use ReferenceSource.java. It is likely to be the reason behind the error considering that is failing in multi-thread.

Probably adding synchronized to getReferenceBasesByRegion would fix that. Is a htsjdk issue and not a GATK one. Do you want to add a workaround in GATK or press for a fix and update of the htsjdk dependency. @droazen?

vruano commented 1 year ago

more concretely the private method getReferenceBases(SAMSeqRecord) should be syncronized or avoid it calling directly to the syncronized getReferenceBases(SSR, boolean) and getReferenceBasesByRegions should not update the cache fields.

nvnieuwk commented 1 year ago

Hi, any news on this? :)