Limit memory usage of dedup ("Killed")

Hi,

The memory usage by UMI-tools is primarily determined by the number of unique UMIs at an alignment position. Your first command and your second command here are very different in this way - the --per-contig in the first command means that UMI-tools treats all reads aligned to the same contig as having the same alignment position, and thus all UMIs on the contig will be treated together. The second does not do this, and only considers reads aligned to the same base as having the same alignment position, and so only treats UMIs with from reads with different alignment coordinates seperately. If we are talking about the same BAM file in each case, then definately one of these is wrong and one correct.

The other major different between the two is the method. unique does not perform any error correction on the UMIs, and is therefore much quicker and less memory intensive than the defeault directional-adjecency method.

Finally --output-stats adds to memory consumption and run time quite substantially, and we tend to recommend not running full samples with it.

Unforuntately, most of the memory used by dedup with the default method is used by a single data structure which represents the relationships between UMI sequences. Because this is a single data structure, we can't really use the disk to reduce memory usage of this (which I think is what is usually happens when you trade off memory for speed).

Our benchmarking:

This suggests that if you are exceeding 16GB of RAM, that you have at least one gene/position with over 16,000 "genunie" UMIs (and many more UMIs that have arrisen by error). The example above which used 16GB had a total of 320,000 unique UMIs at a single position. This would be very high for the sorts of technologies that usually precipitate the use of --per-contig. One exception to this some gene panel resequencing approaches. I'm afraid in those situation there is not much to be done other than hire a cloud VM with more memory.

CGATOxford / UMI-tools

Limit memory usage of dedup ("Killed") #561