Bismark Deduplication Step

sunshine-lp0 commented 5 years ago

Hi,

Why deduplication is not recommended for RRBS, amplicon or other target enrichment-type libraries ? Thank you.

Sincerely, Anita

FelixKrueger commented 5 years ago

The reason for this is rather pragmatic: whenever you expect a very high coverage of a region, and you do not include unique molecular identifiers (UMIs) in your reads, you cannot tell whether a read was a genuine read from a different cell, or if it was a PCR duplicate.

De-duplication which is based on mapping position (and orientation) alone will only allow 1 alignment to a given position. In cases where you can only ever sequence a small number of different fragments (for RRBS this is ~600,000), you would start discarding reads as soon as each fragment was covered once. In other words: the deeper you sequence, the more data would get discarded.

The following schematic tries to illustrate this:

donotdeduplicate

sunshine-lp0 commented 5 years ago

Thank you very much!

Hendricks27 commented 9 months ago

Hi,

Thank you for asking the question, and very nice illustration. Here is probably a dumb question. What if we have 2 fragments (read-pairs) with different methylation patterns, but they align to the exact same position, in this diagram, they are considered duplicates. But should they? Thanks!

FelixKrueger commented 9 months ago

That is a good question indeed. I suppose the answer to that would be: if you have two fragments that align to the very same position but have a distinct methylation pattern, then it depends on whether this difference comes from a sequencing error (= duplicate) or not (non-duplicate). In the latter case, you probably would want to keep both fragments. It is however largely impossible to discriminate between these 2 cases, unless you use UMIs to differentiate duplicates.

Our guideline for shotgun sequencing for e.g. the human genome is:

you have ~ 6 billion different positions for a read to start from (top or bottom strand), and a lot more for paired-end sequencing (where both the start and end position can vary). The chances of hitting two genuine reads or read pairs that start and end at the same position in a typical sequencing experiment are normally fairly slim. Having PCR duplication is a lot more like: hence -> deduplicate
for anything that uses target enrichment, RRBS, or amplicon sequencing you do expect a lot more of reads aligning to the same position, so -> do not deduplicate unless you are using a UMI based approach. Does that make sense?

Hendricks27 commented 9 months ago

Yes, yes. It makes perfect sense to me. I just want to make sure I understand the problem. Thank you so much for your explanation!

FelixKrueger / Bismark

Bismark Deduplication Step #234