GenomicsDBImport not completing for mixed ploidy samples

bshifaw commented 4 years ago

I was able to replicate the users error with GATK4.1.1.0 and the latest build on Nov21. The users data is on the FTP as livinlrg_Problem_Interval.

Running SelectVariants on the same data generates the same error as GenomicsDBImport

/home/tools/gatk/gatk SelectVariants --java-options "-Xmx20g" -R /home/test/livinlrg_Problem_Interval/Reference/sacCer3_2micron.fasta -V gendb:///home/test/livinlrg_Problem_Interval/GenomicsDB_ProblemInterval_Test -O /home/test/livinlrg_Problem_Interval/work_dir/selectvariantsout.vcf

WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
14:28:46.814 INFO  ProgressMeter -            chrI:2000              0.2                  2000          10258.2
14:29:00.359 INFO  ProgressMeter -            chrI:6003              0.4                  6000          14261.4
14:29:28.258 INFO  SelectVariants - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),37.44650732699998,Cpu time(s),37.414083634000015
[November 21, 2019 2:29:29 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.92 minutes.
Runtime.totalMemory()=1783103488
htsjdk.tribble.TribbleException: Invalid block size -1539959833
        at htsjdk.variant.bcf2.BCF2Decoder.readNextBlock(BCF2Decoder.java:66)
        at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:134)
        at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:58)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:181)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:49)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

User Question (LINK)

I'm attempting to call variants on whole genomes for about 500 illumina paired-end samples with varying ploidy (haploid to tetraploid). I'm running a fairly standard uBam to GVCF pipeline with HaplotypeCaller passed the ploidy information (1,2,3, or 4) in -ERC GVCF mode. I then try to collect the GVCFs using GenomicsDBImport in a batch size of 50 and use GenotypeGVCFs on the combined database. My interval list that is passed to GenomicsDBImport is just each chromosome on a separate line. I'm using GATK v4.1.1.0

Command:


${GATK_DIR}/gatk GenomicsDBImport \<br />
        --java-options "-Xmx110g -Xms110g" \<br />
        -R ${REF} \<br />
        --variant ${FILE_LIST} \<br />
        -L ${SCRIPT_DIR}/GATK_Style_Interval.list \<br />
        --genomicsdb-workspace-path ${WORK_DIR}/GenomicsDB_20190912 \<br />
        --batch-size 50 \<br />
        --tmp-dir=${WORK_DIR}/<br />
```<br />
<br />
GenomicsDBImport appears to run without error, but only shows progress for the first 6000 bp before moving onto the next batch. When I run select variants on the created database, I only get variants up to position 6716 in the first interval. When I try to run GenotypeGVCF on it, I get a strange error:<br />
htsjdk.tribble.TribbleException: Invalid block size -1570639203<br />
<br />
My first assumption is that one of the gvcf's is malformed from HaplotypeCaller failing after the first 6000 bp, but I've verified that the gvcfs have all completed and have 'validated' them with ValidateVariants using GATK v4.1.3.0. When I grep for the particular position in the sample's gvcfs I don't find anything out of the ordinary. I would use CombineGVCFs, but it fails due to trying to combine mixed ploidies. <br />
<br />
Any ideas on troubleshooting or experience with problems like this?

This Issue was generated from your [forums] 
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/24446/genomicsdbimport-not-completing-for-mixed-ploidy-samples/p1

TedBrookings commented 4 years ago

I'm going to tag in our Genomics DB experts to take a look at this: @nalinigans @mlathara

TedBrookings commented 4 years ago

Link to user data: https://slack-files.com/T0CMFS7GX-FQW79RU06-03e378d2e9

mlathara commented 4 years ago

Thanks for the files. We've identified the issue as a bug in htslib - pretty straightforward fix. The mixed ploidy is not a problem in this case. But the confluence of multiploidy, number of samples and number of alternate alleles causes some locations in the combined gvcf to have very large arrays of FORMAT data.

That triggers an (unforeseen) overflow in htslib.

With Thanksgiving coming up, we won't be able to get the fix out in a released GenomicsDB this week. Can try for early next week, depending on when you're next doing a GATK release @droazen

marcopessoa commented 4 years ago

Has this been fixed in 4.1.4.1 or should I wait for the next release? Had the same issue with GenotypeGVCFs in 4.1.4.0 using a GenomicsDB database with five GVCFs with pooled samples (12 samples per pool).

marcopessoa commented 4 years ago

Not fixed in 4.1.4.1. Same issue.

nalinigans commented 4 years ago

@marcopessoa, the fixes for this issue are not in master yet - here is the pending PR #6305. You could use the genomicsdb_120_1 branch to test out your scenario and post what you find to provide further validation for the changes in the PR. Thanks.

marcopessoa commented 4 years ago

I downloaded and built genomicsdb_120_1 from source. I deleted my previous GenomicsDB database, and built a new one with the following options:

gatk --java-options "-Xmx128g -Xms128g" GenomicsDBImport -R genome.fasta -V input1.gvcf.gz -V input2.gvcf.gz -V input3.gvcf.gz -V input4.gvcf.gz -V input5.gvcf.gz --genomicsdb-workspace-path my_database --tmp-dir /tmp -L chrom1 -L chrom2 -L chrom3 -L chrom4 -L chrom5 -L chrom6 -L chrom7 -L chrom8 -L chrom9 --verbosity DEBUG --max-num-intervals-to-import-in-parallel 9

I then ran GenotypeGVCFs with the following options for one of the chromosomes:

gatk --java-options "-Xmx16g" GenotypeGVCFs -R genome.fasta -V gendb://my_database -O allpools.chrom2.vcf.gz --tmp-dir=/tmp --sample-ploidy 24 -L chrom2

It failed at the same region it was failing before, with this error message:

01:15:27.623 INFO  GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),8476.664214527651,Cpu time(s),8391.206707930733
[January 14, 2020 1:15:30 AM BRT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 279.78 minutes.
Runtime.totalMemory()=16865820672
htsjdk.tribble.TribbleException: Invalid block size -122708061
        at htsjdk.variant.bcf2.BCF2Decoder.readNextBlock(BCF2Decoder.java:66)
        at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:134)
        at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:58)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:181)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:49)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.loadNextFeature(FeatureIntervalIterator.java:98)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.loadNextNovelFeature(FeatureIntervalIterator.java:74)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.next(FeatureIntervalIterator.java:62)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.next(FeatureIntervalIterator.java:24)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:490)
        at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

HaplotypeCaller ran without errors, so I don't know if I should generate new GVCFs.

Each input GVCF is a pool of 12 diploid plants.

These are the INFO lines when I run GenomicsDBImport:

INFO  GenomicsDBImport - ------------------------------------------------------------
INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.4.1-30-g65f5772-SNAPSHOT
INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
INFO  GenomicsDBImport - Executing as marco.pessoa@baru.local on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_232-b09
INFO  GenomicsDBImport - Start Date/Time: January 14, 2020 7:41:33 AM BRT
INFO  GenomicsDBImport - ------------------------------------------------------------
INFO  GenomicsDBImport - ------------------------------------------------------------
INFO  GenomicsDBImport - HTSJDK Version: 2.21.1
INFO  GenomicsDBImport - Picard Version: 2.21.4
INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
INFO  GenomicsDBImport - Deflater: IntelDeflater
INFO  GenomicsDBImport - Inflater: IntelInflater
INFO  GenomicsDBImport - GCS max retries/reopens: 20
INFO  GenomicsDBImport - Requester pays: disabled
INFO  GenomicsDBImport - Initializing engine
INFO  IntervalArgumentCollection - Processing 567075633 bp from intervals
INFO  GenomicsDBImport - Done initializing engine

And GenotypeGVCFs:

INFO  GenotypeGVCFs - ------------------------------------------------------------
INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.4.1-30-g65f5772-SNAPSHOT
INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
INFO  GenotypeGVCFs - Executing as marco.pessoa@baru.local on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_232-b09
INFO  GenotypeGVCFs - Start Date/Time: January 14, 2020 7:38:08 AM BRT
INFO  GenotypeGVCFs - ------------------------------------------------------------
INFO  GenotypeGVCFs - ------------------------------------------------------------
INFO  GenotypeGVCFs - HTSJDK Version: 2.21.1
INFO  GenotypeGVCFs - Picard Version: 2.21.4
INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
INFO  GenotypeGVCFs - Deflater: IntelDeflater
INFO  GenotypeGVCFs - Inflater: IntelInflater
INFO  GenotypeGVCFs - GCS max retries/reopens: 20
INFO  GenotypeGVCFs - Requester pays: disabled
INFO  GenotypeGVCFs - Initializing engine

kgururaj commented 4 years ago

Based on the messages you have provided. looks like you have built and run the GATK master branch. You should switch to the genomicsdb_120_1 branch (git checkout genomicsdb_120_1) and retry. You don't need to delete the imported workspace. You should see the version string 4.1.4.1-7-gdb054ab-SNAPSHOT in the GATK stdout.

marcopessoa commented 4 years ago

Apologies! Switched to the correct branch. I tested it with a troublesome region I identified in one of the chromosomes, using the following command:

gatk --java-options "-Xmx8g" GenotypeGVCFs -R genome.fasta -V gendb://test_database -O gttest.chrom2.vcf.gz --tmp-dir=/db --sample-ploidy 24 -L chrom2:4320000-4400000 --only-output-calls-starting-in-intervals

Run info:

INFO  GenotypeGVCFs - ------------------------------------------------------------
INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.4.1-7-gdb054ab-SNAPSHOT
INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
INFO  GenotypeGVCFs - Executing as marco.pessoa@baru.local on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_232-b09
INFO  GenotypeGVCFs - Start Date/Time: January 14, 2020 1:53:14 PM BRT
INFO  GenotypeGVCFs - ------------------------------------------------------------
INFO  GenotypeGVCFs - ------------------------------------------------------------
INFO  GenotypeGVCFs - HTSJDK Version: 2.21.0
INFO  GenotypeGVCFs - Picard Version: 2.21.2
INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
INFO  GenotypeGVCFs - Deflater: IntelDeflater
INFO  GenotypeGVCFs - Inflater: IntelInflater
INFO  GenotypeGVCFs - GCS max retries/reopens: 20
INFO  GenotypeGVCFs - Requester pays: disabled
INFO  GenotypeGVCFs - Initializing engine

Run starts, a few variants are detected. Then It gets stuck in a region for about 30 minutes, and prints an out-of-memory error:

INFO  ProgressMeter - chrom2:4323711              0.3                  2000           7859.6
INFO  ProgressMeter - chrom2:4325583              0.6                  3000           4753.5
INFO  ProgressMeter - chrom2:4327262              0.8                  4000           5010.5
INFO  ProgressMeter - chrom2:4333146              1.1                  7000           6493.1
INFO  GenotypeGVCFs - Shutting down engine

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at htsjdk.tribble.util.ParsingUtils.split(ParsingUtils.java:266)
        at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:375)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:181)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:49)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.loadNextFeature(FeatureIntervalIterator.java:98)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.loadNextNovelFeature(FeatureIntervalIterator.java:74)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.next(FeatureIntervalIterator.java:62)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.next(FeatureIntervalIterator.java:24)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)  
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) 
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:490) 
        at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

Changed heap size to 56 Gb and 128 Gb, and the problem persists.

kgururaj commented 4 years ago

It could be that the large number of genotypes being returned by GenomicsDB causes the GenotypeGVCFs to consume a really large amount of memory. I think @mlathara is working on a possible fix in #6377. For now, can you try running SelectVariants instead of GenotypeGVCFs on the problem region? This will help determine if the issue is in GenomicsDB or when GenomicsDB data is processed by GenotypeGVCFs. I realize this won't solve your problem of running GenotypeGVCFs correctly.

marcopessoa commented 4 years ago

I ran the following command:

gatk --java-options "-Xmx96g" SelectVariants -R genome.fasta -V gendb://test_database -O hctest.combinedvariants.chrom2.g.vcf.gz

SelectVariants also failed in the same problem region. It started running, got stuck for about 50 minutes and failed:

14:07:45.004 INFO  ProgressMeter - chrom2:4323705              0.3                  2000           6209.6
14:08:00.372 INFO  ProgressMeter - chrom2:4327258              0.6                  4000           6916.8
14:08:16.027 INFO  ProgressMeter - chrom2:4333144              0.8                  7000           8341.1
14:56:17.993 INFO  SelectVariants - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),2722.5394666619995,Cpu time(s),2710.540034284
[January 16, 2020 2:56:22 PM BRT] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 48.97 minutes.
Runtime.totalMemory()=39383990272
Exception in thread "main" java.lang.OutOfMemoryError
        at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161)
        at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
        at java.lang.StringBuilder.append(StringBuilder.java:190)
        at htsjdk.tribble.readers.LongLineBufferedReader.readLine(LongLineBufferedReader.java:340)
        at htsjdk.tribble.readers.LongLineBufferedReader.readLine(LongLineBufferedReader.java:356)
        at htsjdk.tribble.readers.SynchronousLineReader.readLine(SynchronousLineReader.java:51)
        at htsjdk.tribble.readers.LineIteratorImpl.advance(LineIteratorImpl.java:24)
        at htsjdk.tribble.readers.LineIteratorImpl.advance(LineIteratorImpl.java:11)
        at htsjdk.samtools.util.AbstractIterator.next(AbstractIterator.java:57)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
        at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:181)
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.next(GenomicsDBFeatureIterator.java:49)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

It might help to mention that I used the GATK 3.8 to rerun HaplotypeCaller -ERC GVCF, CombineGVCFs and GenotypeGVCFs for this interval in the same samples.

HaplotypeCaller warned that the largest number of PLs found was 2925 (default value is 100), so I set -max_num_PL_values accordingly. CombineGVCFs worked without issues (it is also worth mentioning that in the GATK 4.0 I only switched to GenomicsDBImport after CombineGVCFs presented an Out of Memory error).

GenotypeGVCFs then warned that for some active regions in some samples the maximum allowed number of PLs (set to 3000) was exceeded with 20475 possible genotypes.

Still, it manages to ignore these regions, genotype samples and output a valid VCF file.

I assume these are the regions causing problems in both GenotypeGVCFs and SelectVariants in the GATK 4.0.

Would high sample sequence coverage lead to such issues? Please let me know if you need any information about the experiment.

mlathara commented 4 years ago

@marcopessoa would you mind trying out a potential fix for us? If you could check out branch genomicsdb_120_6275_fix and build again. Then try your GenotypeGVCFs command line again.

The current defaults for --max-alternate-alleles (6) and --max-genotype-count (1024) were not being passed down to GenomicsDB and this fix changes that.

If the fix doesn't work, and you're willing to share a minimal set of your data that can reproduce the issue, we'll be happy to take a closer look. Thanks!

marcopessoa commented 4 years ago

I rebuilt the new branch and got the same OutOfMemory error with genomicsdb_120_6275_fix.

Should I still see v4.1.4.1-7-gdb054ab-SNAPSHOT for this fix?

Please let me know what data I should provide for you to reproduce this on your side. Thanks a lot for your help.

mlathara commented 4 years ago

You should see v4.1.4.1-8-g828e2bf-SNAPSHOT with the genomicsdb_120_6275_fix.

Try doing a git checkout genomicsdb_120_6275_fix in your git repository and then building again with ./gradlew installDist. Then run the test again.

marcopessoa commented 4 years ago

Finally managed to build the correct snapshot. It ran successfully in the problem region!

It also reported Sample/Callsets with too many genotypes in combined VCF.

I will keep testing it in other regions, but I can see now that --max-alternate-alleles and --max-genotype-count are being passed to GenomicsDB.

However, the current limit for genotype counts is set to 100000 (I could see that in the code for this PR in the file GenomicsDBOptions.java, guess that was hard coded), and it remained this way even if I set it to 1024 with the --max-genotype-count option.

Not sure if this also fixes the SelectVariants issue, though.

Once again, thanks a lot for your help. You guys are amazing! Can't wait to see this in a new release.

mlathara commented 4 years ago

Great to hear - my mistake on the hardcoded 100000. I was just trying to make that the default, will get that fixed to hew to the argument passed in.

SelectVariants will still fail with this branch - we'll have to do something similar for that as well.

marcopessoa commented 4 years ago

Is the --max-genotype-count option fixed in any new branch (not hard coded at 100000)? I am kind of stuck in my project for now while GenotypeGVCFs doesn't work, and I would like to set a lower value to this option. Thanks!

nalinigans commented 4 years ago

@marcopessoa, not yet. We will work on this early this week.

nalinigans commented 4 years ago

@marcopessoa, I was hoping to get PR #6305 into the release this week, but it is still ongoing. Meanwhile, please use the genomicsdb_120_1 branch where you can specify --max-genotype-count with the GenotypeGCFs/GnarlyGenotyper tools and --genomicsdb-max-genotype-count with the SelectVariants tool to be passed on to GenomicsDB. I will post here once there is a release with the changes. Thanks for your patience.

droazen commented 4 years ago

Closing as resolved

broadinstitute / gatk

GenomicsDBImport not completing for mixed ploidy samples #6275

User Question (LINK)