CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Memory issue #474

Closed sameerh closed 1 year ago

sameerh commented 3 years ago

Dear Developers, I ma running umitools dedup for merged bam file. umi_tools dedup -I Merged_sorted.bam -S dedup_Merged_sorted.bam --method=unique --extract-umi-method=tag --umi-tag=UB --cell-tag=CB.

I get memory issue when running this command. slurmstepd: error: Detected 1 oom-kill event(s) in step 4783287.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Any suggestion to solve this issue will be appreciated.

Thanks in advance!

Sameer

IanSudbery commented 3 years ago

What do you mean by "merged bamfile"? I see that you have specified a cell-tag. If this is single cell sequencing, then you almost certainly need to specify --per-cell, otherwise it will treat all reads as coming from the same cell, and will try to hold everything in memory at once. Many single cell techniques also require --per-gene, depending on whether fragmentation happens before or after PCR amplification.

But even so, with --method=unqiue you must have a very large number of UMIs at the same location (or a very small amount of memory) to be running out of memory, as --method=unique does very little in that requires much memory.

sameerh commented 3 years ago

Thanks IanSudbery! Yes, I am having single cell sequencing data. Here, I am not using the data for downstream single cell analysis. Rather i am using this for analysing the peaks for the genes of our interest. However, i later realised instead of merging all the bam files, is it possible to provide multiple input files in umitools?

Thanks!

Sameer

IanSudbery commented 3 years ago

It is not possible to provide multiple input files to umitools, but even if you did, the output is unlikely to be valid. UMI-tools in unique mode assumes that two reads with the same UMI mapping to the same location are duplicates of each other. However, two reads could have the same UMI, and map to the same location, but if they came from different cells, then they could not be PCR duplicates. Thus, you really need to be running with --per-cell. However, this is going to increase the memory usage, not decrease, as UMI-tools will have to keep a read in memory for each UMI for each cell, where as at the moment it only store one read per UMI.

Do you know what the sequencing depth of your sample is and how much memory is available, because I have never seen a dedup process run out of memory when --method=unique is specified.