Why use seqkit rmdup to remove UMI duplicates?

Zhe-jiang / PRAISE

Bioinformatics guide and scripts for PRAISE, a quantitative pseudouridine sequencing method

4 stars 0 forks source link

Why use seqkit rmdup to remove UMI duplicates? #2

Open xiaohe0404 opened 1 year ago

xiaohe0404 commented 1 year ago

Sorry to bother you. The method section of the paper states that “We then used Seqkit to deduplicate based on 8 bp unique molecular identifier (UMI) at the 5′ end of reads R2, key process parameters are as follows: seqkit rmdup -s.”. However, what seqkit rmdup considers is the full sequence, not only the 8bp UMI. It seems that UMI information is redundant and dispensable in this step. Therefore, I'm very confused how to combine UMI and seqkit rmdup to remove duplicates. Looking forward to your reply! Best wishes

xiaohe0404 commented 1 year ago

By the way, can you pull detailed scripts on github such as site_merge.py? I really really really appreciate you!!

Zhe-jiang commented 1 year ago

Thank you for using our method PRAISE.

Regarding your question about using seqkit to duplicate. it is true that seqkit considers the entire sequence. Therefore, it will duplicate the sequence with the same insert sequence and same UMI, which we consider as "PCR duplication".

And I am planning to put more scripts in the 'Call signal' session sometime later. Thank you for your patience.

xiaohe0404 commented 1 year ago

Thanks for your replying! Looking forward to your early update sincerely!