CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
494 stars 190 forks source link

deduplication without mapping #264

Closed retrogenomics closed 6 years ago

retrogenomics commented 6 years ago

I'm wondering if it would be possible to use UMI-tools to perform deduplication just based on read sequence. Let's consider a read with a 5nt-UMI followed by a DNA sequence of biological origin. It should be possible to extract a longer sequence (let's say 20 nt in total), which would correspond to the actual UMI+target nucleic acid. The 20-nt sequence could be extracted and added to read name, but only the 5-nt original UMI would be actually removed from read sequence. Then grouping based on the 20-nt sequence (>1e+12 theoritical combinations) using the UMI-tools error corrections, deduplication keeping the read with highest base quality scores, and ultimately mapping. I can see a big advantage to this approach: reads that map to multiple genomic coordinates should be correctly handled - this would be of particular interest for small RNA, transposable elements, and other repeated sequences. A drawback could be the computational resources to dedup a 20-nt or longer sequence? I have been thinking of ways to hijack UMI-tools options to achieve this, but I couldn't figure out any straightforward approach.

Do you have any suggestions on how to achieve this with UMI-tools? or any comment on the approach itself?

TomSmithCGAT commented 6 years ago

hi @retrogenomics - We have actually been thinking a little bit about whether it's possible to deduplicate without alignment for similar reasons. We've yet to come up with a suitable solution since the mapping is actually performing a very useful function in helping us to group reads which may be duplicates, without mapping we essentially have to consider all reads at once which becomes much more onerous. I believe there may well be a suitable approach utilising de-bruijn graphs or kmer hashing or something along these lines to simplify the representation of reads prior to deduplication but we haven't had time to play around with this yet.

retrogenomics commented 6 years ago

Thanks for your feedbacks.