Closed yaaminiv closed 5 years ago
From this issue
The reason for this is rather pragmatic: whenever you expect a very high coverage of a region, and you do not include unique molecular identifiers (UMIs) in your reads, you cannot tell whether a read was a genuine read from a different cell, or if it was a PCR duplicate.
De-duplication which is based on mapping position (and orientation) alone will only allow 1 alignment to a given position. In cases where you can only ever sequence a small number of different fragments (for RRBS this is ~600,000), you would start discarding reads as soon as each fragment was covered once. In other words: the deeper you sequence, the more data would get discarded.
Looks like I should rerun bismark
(or at least rerun a subset without deduplication and compare it to a subset with deduplication). Thoughts @sr320 / @kubu4?
This looks like a good paper to explain this type of analysis:
https://www.nature.com/articles/nrg3273
Read the section: Processing enrichment-based data.
Actually, I guess you should probably read the entire thing. But, that section addresses why deduplication should/shouldn't be performed on enriched samples (I think).
I still do not fully follow logic - to me it would always keep more than one alignment.... but sure go ahead and test
NOTE you do not have to rerun heavy step of alignment -
Should be quick to do locally.
${bismark_dir}/bismark_methylation_extractor \
--bedGraph --counts --scaffolds \
--multicore 14 \
--buffer_size 75% \
*bam
# Sort files for methylkit and IGV
find *.bam | \
xargs basename -s .bam | \
xargs -I{} ${samtools} \
sort --threads 28 {}.bam \
-o {}.sorted.bam
# Index sorted files for IGV
# The "-@ 16" below specifies number of CPU threads to use.
find *.sorted.bam | \
xargs basename -s .sorted.bam | \
xargs -I{} ${samtools} \
index -@ 28 {}.sorted.bam
Having read more of that Nature paper, I am feeling confident that deduplication is the way to go. I'm basing this conclusion on the way they distinguish "bisulfite sequencing" and "enrichment-based methods".
Here's info from Box 1 of that paper:
Bisulphite sequencing
Chemical treatment of the DNA with sodium bisulphite gives rise to methylation- specific sequence variants, which can be mapped and quantified by next-generation sequencing. Key advantages of this technology are its comprehensive genomic coverage, high quantitative accuracy and excellent reproducibility. Disadvantages include the high cost of whole-genome resequencing and the difficulty of discriminating between 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC)144,145. Bisulphite sequencing can be combined with enrichment strategies using restriction enzymes (in reduced-representation bisulphite sequencing (RRBS)146) or DNA fragment capture147,148 to target bisulphite sequencing to a specific fraction of the genome and thereby to reduce the per-sample cost of sequencing.
Enrichment-based methods
DNA-methylation-specific antibodies, methyl-binding domain proteins or restriction enzymes are used to enrich for a fraction of highly methylated (or unmethylated) DNA fragments, and the enrichment of specific fragments is quantified by next-generation sequencing. The two key advantages of enrichment-based methods are the relatively low cost of achieving genome-wide coverage (albeit with a low statistical power in CpG-poor genomic regions64) and the ability to distinguish between different forms of DNA methylation: for example, using antibodies that specifically recognize 5hmC but not 5mC. However, these advantages come at the cost of relatively low resolution and high susceptibility to experimental biases.
My take on this is that there are groups that employ enrichment-based methods, but do not perform bisulfite conversion.
Since we do put our samples through bisulfite conversion, our samples fall into the "bisulphite conversion" description, which is not "enrhichment-based".
Bisulphite sequencing can be combined with enrichment strategies using restriction enzymes (in reduced-representation bisulphite sequencing (RRBS)146) or DNA fragment capture147,148 to target bisulphite sequencing to a specific fraction of the genome and thereby to reduce the per-sample cost of sequencing.
That's a pretty accurate description of what we did. I think I'm going to stick with deduplicated samples.
Although...RRBS is still a bisulfite sequencing method, correct? Why would that not need deduplication?
Sounds good
On Apr 4, 2019, 1:18 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:
Bisulphite sequencing can be combined with enrichment strategies using restriction enzymes (in reduced-representation bisulphite sequencing (RRBS)146) or DNA fragment capture147,148 to target bisulphite sequencing to a specific fraction of the genome and thereby to reduce the per-sample cost of sequencing. That's a pretty accurate description of what we did. I think I'm going to stick with deduplicated samples. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
It’s all explained in the paper if you really want to know On Apr 4, 2019, 1:19 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:
Although...RRBS is still a bisulfite sequencing method, correct? Why would that not need deduplication? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
It doesn't have to to with bisulfite sequencing or not, it has to do with type of enrichment - specifically with a restriction enzyme that cuts the same spot. For RRBS, my sequences will always start with the restriction site, then when I sequence 100bp out from there I am definitely going to get duplicated sequences because they will all line up to those cut sites. I have the same issue when I do RAD-Seq and only sequence from 1 end. I can't deduplicate because all my reads will line up at the same place. If you're doing whole genome or an enrichment protocol where you've randomly sheared across the genome (or if you've done paired end sequencing for RAD), then you wouldn't expect to get sequences that are EXACTLY the same - unless they are PCR duplicates (at which point if you include the duplicates you are biasing your methylation call to the same fragment that just happened to get amplified lots of times, which is bad).
TLDR - I think you should deduplicate your MBD data since it's randomly fragmented
Got it. Thanks @mgavery !
Thanks! Now I understand! On Apr 4, 2019, 2:10 PM -0700, mgavery notifications@github.com, wrote:
It doesn't have to to with bisulfite sequencing or not, it has to do with type of enrichment - specifically with a restriction enzyme that cuts the same spot. For RRBS, my sequences will always start with the restriction site, then when I sequence 100bp out from there I am definitely going to get duplicated sequences because they will all line up to those cut sites. I have the same issue when I do RAD-Seq and only sequence from 1 end. I can't deduplicate because all my reads will line up at the same place. If you're doing whole genome or an enrichment protocol where you've randomly sheared across the genome (or if you've done paired end sequencing for RAD), then you wouldn't expect to get sequences that are EXACTLY the same - unless they are PCR duplicates (at which point if you include the duplicates you are biasing your methylation call to the same fragment that just happened to get amplified lots of times, which is bad). TLDR - I think you should deduplicate your MBD data since it's randomly fragmented — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I deduplicated my files in
bismark
:@kubu4 pointed out the manual does not recommend deduplication for reduced representation methods:
I'm not entirely sure why deduplication would be a bad thing for MBD (I tried tagging Mac in this issue but couldn't find her username because I figured she may have insight)