CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

Only softclip of 2bp at start of read distinguishes it from another read with same barcode #534

Closed alexander-e-f-smith closed 8 months ago

alexander-e-f-smith commented 2 years ago

Hi I have noticed that UMIdedup will not treat two reads as molecular duplicates in this situation (informative read info only shown):

read A: VH00216:2:AAALVM3M5:1:2601:70948:9368_TGTAAGAATA 163 chr2 42263189 255 118M1399N13M = 42263290 1763 CATCACCAGCTGAAAAGTCACATAATTCTTGGGAAAATTCAGATGATAGCCGTAATAAATTGTCGAAAATACCTTCAACACCCAAATTAATACCAAAAGTTACCAAAACTGCAGACAAGCATAAAGATGTC

readB: VH00216:2:AAALVM3M5:1:2601:60268:8535_TGTAAGAATA 163 chr2 42263191 255 2S116M1399N13M = 42263290 1761 CTTCACCAGCTGAAAAGTCACATAATTCTTGGGAAAATTCAGATGATAGCCGTAATAAATTGTCGAAAATACCTTCAACACCCAAATTAATACCAAAAGTTACCAAAACTGCAGACAAGCATAAAGATGTC

The difference here being the mismatch and aligner designation of softclipped for the first two bases of read and consequential alignment coordinate difference of 2bp.. (the pairs of each of two these reads are pretty much identical as each other have with the same alignment coordinates/length)

Is this expected and will any options (such as --soft-clip-threshold) be able to correct this. Or do I need to think about the aligner behaviour (star). Thanks for your help once again!!

IanSudbery commented 2 years ago

Hi Alex,

The problem is that these reads are second in pair. Currently we are only able to support accounting for soft-clipping in read1. This is due to how we process reads - we use the position of read1, plus the UMI, plus the size of the insert. This allows us to deduplicate using only read1 (without having seen read2), but leads to this problem. We are currently actively trying to find ways around this, but at the moment, it remains a known problem.

alexander-e-f-smith commented 2 years ago

Thanks Ian, yes that may be a bigger problem than I thought then. Can you possibly confirm then, that if this were the first in the pair (flagged 83 or 99 for example), the soft-clipping would be taken into account and in this situation de-duplication would have occurred on the read pair ...just using the data from the first read in the pair (does this require --soft-clip-threshold to be 2?). If you get a feature branch up and running for fixes relating to the softclipping issue, or any other ability to use read2 when read 1 is defunct in any way (such as unaligned and use of --use-unaligned will use read2 or read 1 rather than just read 1), i'd be interested in testing.

IanSudbery commented 2 years ago

Hi Alex. Indeed, if these had been read1s then they would have been counted as the duplicates.

TomSmithCGAT commented 8 months ago

Closing due to inactivity