Closed sunshine-lp0 closed 5 years ago
The reason for this is rather pragmatic: whenever you expect a very high coverage of a region, and you do not include unique molecular identifiers (UMIs) in your reads, you cannot tell whether a read was a genuine read from a different cell, or if it was a PCR duplicate.
De-duplication which is based on mapping position (and orientation) alone will only allow 1 alignment to a given position. In cases where you can only ever sequence a small number of different fragments (for RRBS this is ~600,000), you would start discarding reads as soon as each fragment was covered once. In other words: the deeper you sequence, the more data would get discarded.
The following schematic tries to illustrate this:
Thank you very much!
Hi,
Thank you for asking the question, and very nice illustration. Here is probably a dumb question. What if we have 2 fragments (read-pairs) with different methylation patterns, but they align to the exact same position, in this diagram, they are considered duplicates. But should they? Thanks!
That is a good question indeed. I suppose the answer to that would be: if you have two fragments that align to the very same position but have a distinct methylation pattern, then it depends on whether this difference comes from a sequencing error (= duplicate) or not (non-duplicate). In the latter case, you probably would want to keep both fragments. It is however largely impossible to discriminate between these 2 cases, unless you use UMIs to differentiate duplicates.
Our guideline for shotgun sequencing for e.g. the human genome is:
Yes, yes. It makes perfect sense to me. I just want to make sure I understand the problem. Thank you so much for your explanation!
Hi,
Why deduplication is not recommended for RRBS, amplicon or other target enrichment-type libraries ? Thank you.
Sincerely, Anita