Closed retrogenomics closed 6 years ago
hi @retrogenomics - We have actually been thinking a little bit about whether it's possible to deduplicate without alignment for similar reasons. We've yet to come up with a suitable solution since the mapping is actually performing a very useful function in helping us to group reads which may be duplicates, without mapping we essentially have to consider all reads at once which becomes much more onerous. I believe there may well be a suitable approach utilising de-bruijn graphs or kmer hashing or something along these lines to simplify the representation of reads prior to deduplication but we haven't had time to play around with this yet.
Thanks for your feedbacks.
I'm wondering if it would be possible to use UMI-tools to perform deduplication just based on read sequence. Let's consider a read with a 5nt-UMI followed by a DNA sequence of biological origin. It should be possible to extract a longer sequence (let's say 20 nt in total), which would correspond to the actual UMI+target nucleic acid. The 20-nt sequence could be extracted and added to read name, but only the 5-nt original UMI would be actually removed from read sequence. Then grouping based on the 20-nt sequence (>1e+12 theoritical combinations) using the UMI-tools error corrections, deduplication keeping the read with highest base quality scores, and ultimately mapping. I can see a big advantage to this approach: reads that map to multiple genomic coordinates should be correctly handled - this would be of particular interest for small RNA, transposable elements, and other repeated sequences. A drawback could be the computational resources to dedup a 20-nt or longer sequence? I have been thinking of ways to hijack UMI-tools options to achieve this, but I couldn't figure out any straightforward approach.
Do you have any suggestions on how to achieve this with UMI-tools? or any comment on the approach itself?