CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

--ignore-tlen slow running #557

Closed alexander-e-f-smith closed 1 year ago

alexander-e-f-smith commented 2 years ago

To avoid clipping issues in read2 that can alter 5' mapping start positions of otherwise duplicated molecules (eg see issue #534), i'm looking at option " --ignore-tlen". However, analysis seems to run drastically slower with this option. Why would this be and this there a way around this? Thanks once again for your help! Alex

IanSudbery commented 2 years ago

Almost certainly this is because UMI-tools is considering a larger number of reads at once. Two reads with the same start position but different tlens are put in separate read bundles. With ignore-tlen they will be put in the same read bundle. As the UMI-tools directional algorithm scales super-linearly with the number of UMIs in the bundle, this is presumably why its running more slowly.

alexander-e-f-smith commented 2 years ago

Thanks Ian. I have a hunch that a certain gene in my data (RNAseq type data) that is highly expressed could cause this. Does the dedup tool consider position only when initially grouping / bundling then? Or are these bundles a factor of UMI and read1 start (when using --ignore-tlen). I had also wondered if you had thought of an option to limit max numbers of reads in any one read bundle/group?

IanSudbery commented 2 years ago

The purpose of generating a read bundle is to consider alll the unique UMIs at a position and decide which ones represent unique reads. Only one read per position/UMI is kept (plus a count of the number of times this read/UMI combo has been encountered), but the bundle contains one read for each unique UMI at a position. The algorithm must then process these which involves not quite calculating the pairwise edit distances between all pairs of UMIs, but of that order of magnitude, and then constructing the graph of relationships between UMIs. This quickly gets complex if there are a large number (10s-100s of thousands) of UMIs at a given position and/or the ones that are there are densely connected.