PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
250 stars 73 forks source link

gridss.kraken.ExtractBestViralReference throws java.lang.OutOfMemoryError with SARS-CoV2 #502

Open suhrig opened 3 years ago

suhrig commented 3 years ago

Hi Daniel,

I have a sample where Kraken2 reports a hit allegedly originating from SARS-CoV2. It's probably a false positive, but that is not the point. The point is that gridss.kraken.ExtractBestViralReference runs out of memory whenever a hit from SARS-CoV2 is given as input. I assume this is because there are a ton of sequences from this virus in the database, since everyone sequences this virus nowadays, and enumerating all the kmers from all of these sequences probably needs quite some RAM (and also CPU time).

Here is a minimal test case to reproduce:

Make a file summary_taxa.tsv with the following content:

taxid_genus name_genus  reads_genus_tree    taxid_species   name_species    reads_species_tree  taxid_assigned  name_assigned   reads_assigned_tree reads_assigned_direct
694002  Betacoronavirus 2   694009  Severe acute respiratory syndrome-related coronavirus   2   2697049 Severe acute respiratory syndrome coronavirus 2 2   2

Then, run the following command:

java -Xmx8g -Dpicard.useLegacyParser=false -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=true -Dsamjdk.buffer_size=4194304 -Dsamjdk.async_io_read_threads=4 -cp /opt/gridss/gridss-2.12.0-gridss-jar-with-dependencies.jar gridss.kraken.ExtractBestViralReference \
                --INPUT_SUMMARY summary_taxa.tsv \
                --INPUT_VIRAL_READS put_any_fastq_file_here_content_does_not_matter.fq \
                --OUTPUT kraken2.fa \
                --OUTPUT_SUMMARY summary_references.tsv \
                --NCBI_NODES_DMP $kraken2db/taxonomy/nodes.dmp \
                --SEQID2TAXID_MAP $kraken2db/seqid2taxid.map \
                --OUTPUT_MATCHING_KMERS kmercounts.tsv \
                $(for fa in $(find $kraken2db/library/viral $kraken2db/library/added -name '*.fna') ; do echo -n "--KRAKEN_REFERENCES $fa "; done)

After a few minutes, the command terminates with:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
        at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
        at htsjdk.samtools.reference.AbstractIndexedFastaSequenceFile.getSubsequenceAt(AbstractIndexedFastaSequenceFile.java:197)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:49)
        at htsjdk.samtools.reference.AbstractIndexedFastaSequenceFile.getSequence(AbstractIndexedFastaSequenceFile.java:162)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSequence(IndexedFastaSequenceFile.java:49)
        at gridss.kraken.ExtractBestViralReference.lambda$null$3(ExtractBestViralReference.java:118)
        at gridss.kraken.ExtractBestViralReference$$Lambda$133/0x000000084013d840.apply(Unknown Source)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
        at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
        at gridss.kraken.ExtractBestViralReference.doWork(ExtractBestViralReference.java:119)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at gridss.kraken.ExtractBestViralReference.main(ExtractBestViralReference.java:241)

I had to increase the Java heap space to 50 GB to make it work. It took 4 hours to run, but did succeed in the end.

Regards, Sebastian

d-cameron commented 3 years ago

Thanks for the repro details. Unable to reproduce locally but I suspect that's because NCBI has added many additional SARS-COV2 genomes since I built my VIRUSBreakend database. It'll take me a some time to rebuild (you may have noticed NCBI downloads are unreliable) so I can reproduce.

In the meantime, I've made a reference database available that doesn't have this issue: https://melbourne.figshare.com/articles/dataset/virusbreakenddb_20210401_tar_gz/14782299

suhrig commented 3 years ago

Thank you so much for providing a prebuilt database. I can confirm the issue does not happen with your build. I consider this solved and we can close the ticket, unless you want to keep it open to further investigate the issue with a newer build of the database.

d-cameron commented 3 years ago

I'll keep it open until I fix the underlying issue to work with the latest NCBI neighbour genomes.