Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
63 stars 8 forks source link

Java heap space #14

Open karlkashofer opened 2 years ago

karlkashofer commented 2 years ago

Running umicollapse on 200mio paired end reads (400 reads total) runs out of Java heap space even with -Xmx96G. Is that normal ?

Daniel-Liu-c0deb0t commented 2 years ago

It should not fail with only 400 reads. Have you tried setting -Xms to a larger value? That is the initial heap size. What is the exact command you are running? Paired-end mode takes up more memory, but it shouldn't run out of memory for only 400 reads.

karlkashofer commented 2 years ago

Sorry, i meant 200mio paired end reads which is 400mio reads total.

Daniel-Liu-c0deb0t commented 2 years ago

If you are using paired-end mode (--paired), it takes a lot of memory. This is because it has to make sure pairs of reads stay together during the deduplication process. This involves storing a lot of reads in memory. Potential workarounds could be splitting the 200 million paired end reads into smaller files and deduplicating them, or not using paired-end mode (but then there might exist pairs of reads where only one read of the pair is removed).

karlkashofer commented 2 years ago

Yes, i use --paired as this is Illumina NovaSeq data from Agilent XT libraries (dual index and dual UMI). I dont really understand why --paired need so much memory. In your paper you state "the reads at each unique alignment location are independently deduplicated based on the UMI sequences. ", so i understand it only needs to keep all reads at a single position within memory. I deduplicate WGS data, there is hardly a position with more than 100 reads, so i really dont understand why it would require > 80GB of memory.

Thanks for your work btw ! :)