CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

dedup with `paired` option #581

Closed YichaoOU closed 1 year ago

YichaoOU commented 1 year ago

Hello,

Since paired option is slow, I'm wondering what will happen if I do not use the paired option?

If read A and B have the same UMI, but:

  1. A and B have different tlen

  2. A R1 and B R1 is different, but A R2 and B R2 mapped to the same exact location.

Will A and B be collapsed into one read?

Thanks, Yichao

IanSudbery commented 1 year ago

In single end mode, R2s are always discarded, as single-end BAM files should not have R2s. If the R1 from A and B are different, then they will be not be collapsed.

However, in recent releases, paired mode should not be substantially slower than single end mode for the majority of datasets. Or, at least, it is not the pairing per-se that makes it slower; paired mode might be slower because it leads to more reads being considered independent of each other, and therefore gives a more complex network to devconvolve.

YichaoOU commented 1 year ago

Thank you! It seems to be a newly fixed issue in 1.1.3 and above? #539

I will upgrade.

Thanks, Yichao