CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

Limit memory usage of dedup ("Killed") #561

Closed cf-nb closed 1 year ago

cf-nb commented 2 years ago

Hello, when running

umi_tools dedup --stdin=[in-file.bam] --per-contig --per-gene

everything seems to run fine until suddenly the process says "Killed". It is clear that my laptop (16GB RAM) is not sufficient for this, but for example

umi_tools dedup -I [in-file.bam] --output-stats=deduplicated -S deduplicated.bam --extract-umi-method read_id --method unique

finishes just fine without any memory issues.

Is there a way to limit the memory usage of dedup, so that the first process can keep running without being killed? I don't mind it taking a bit longer (the second command finishes in only about 10 minutes for my type of .bam-file).

Thanks!

IanSudbery commented 2 years ago

Hi,

The memory usage by UMI-tools is primarily determined by the number of unique UMIs at an alignment position. Your first command and your second command here are very different in this way - the --per-contig in the first command means that UMI-tools treats all reads aligned to the same contig as having the same alignment position, and thus all UMIs on the contig will be treated together. The second does not do this, and only considers reads aligned to the same base as having the same alignment position, and so only treats UMIs with from reads with different alignment coordinates seperately. If we are talking about the same BAM file in each case, then definately one of these is wrong and one correct.

The other major different between the two is the method. unique does not perform any error correction on the UMIs, and is therefore much quicker and less memory intensive than the defeault directional-adjecency method.

Finally --output-stats adds to memory consumption and run time quite substantially, and we tend to recommend not running full samples with it.

Unforuntately, most of the memory used by dedup with the default method is used by a single data structure which represents the relationships between UMI sequences. Because this is a single data structure, we can't really use the disk to reduce memory usage of this (which I think is what is usually happens when you trade off memory for speed).

Our benchmarking: image

This suggests that if you are exceeding 16GB of RAM, that you have at least one gene/position with over 16,000 "genunie" UMIs (and many more UMIs that have arrisen by error). The example above which used 16GB had a total of 320,000 unique UMIs at a single position. This would be very high for the sorts of technologies that usually precipitate the use of --per-contig. One exception to this some gene panel resequencing approaches. I'm afraid in those situation there is not much to be done other than hire a cloud VM with more memory.