Open karlkashofer opened 2 years ago
It should not fail with only 400 reads. Have you tried setting -Xms
to a larger value? That is the initial heap size. What is the exact command you are running? Paired-end mode takes up more memory, but it shouldn't run out of memory for only 400 reads.
Sorry, i meant 200mio paired end reads which is 400mio reads total.
If you are using paired-end mode (--paired
), it takes a lot of memory. This is because it has to make sure pairs of reads stay together during the deduplication process. This involves storing a lot of reads in memory. Potential workarounds could be splitting the 200 million paired end reads into smaller files and deduplicating them, or not using paired-end mode (but then there might exist pairs of reads where only one read of the pair is removed).
Yes, i use --paired as this is Illumina NovaSeq data from Agilent XT libraries (dual index and dual UMI). I dont really understand why --paired need so much memory. In your paper you state "the reads at each unique alignment location are independently deduplicated based on the UMI sequences. ", so i understand it only needs to keep all reads at a single position within memory. I deduplicate WGS data, there is hardly a position with more than 100 reads, so i really dont understand why it would require > 80GB of memory.
Thanks for your work btw ! :)
Running umicollapse on 200mio paired end reads (400 reads total) runs out of Java heap space even with -Xmx96G. Is that normal ?