FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
392 stars 102 forks source link

Bismark Deduplication Step #234

Closed sunshine-lp0 closed 5 years ago

sunshine-lp0 commented 5 years ago

Hi,

Why deduplication is not recommended for RRBS, amplicon or other target enrichment-type libraries ? Thank you.

Sincerely, Anita

FelixKrueger commented 5 years ago

The reason for this is rather pragmatic: whenever you expect a very high coverage of a region, and you do not include unique molecular identifiers (UMIs) in your reads, you cannot tell whether a read was a genuine read from a different cell, or if it was a PCR duplicate.

De-duplication which is based on mapping position (and orientation) alone will only allow 1 alignment to a given position. In cases where you can only ever sequence a small number of different fragments (for RRBS this is ~600,000), you would start discarding reads as soon as each fragment was covered once. In other words: the deeper you sequence, the more data would get discarded.

The following schematic tries to illustrate this:

donotdeduplicate

sunshine-lp0 commented 5 years ago

Thank you very much!

Hendricks27 commented 9 months ago

Hi,

Thank you for asking the question, and very nice illustration. Here is probably a dumb question. What if we have 2 fragments (read-pairs) with different methylation patterns, but they align to the exact same position, in this diagram, they are considered duplicates. But should they? Thanks!

FelixKrueger commented 9 months ago

That is a good question indeed. I suppose the answer to that would be: if you have two fragments that align to the very same position but have a distinct methylation pattern, then it depends on whether this difference comes from a sequencing error (= duplicate) or not (non-duplicate). In the latter case, you probably would want to keep both fragments. It is however largely impossible to discriminate between these 2 cases, unless you use UMIs to differentiate duplicates.

Our guideline for shotgun sequencing for e.g. the human genome is:

Hendricks27 commented 9 months ago

Yes, yes. It makes perfect sense to me. I just want to make sure I understand the problem. Thank you so much for your explanation!