RobertsLab / resources

https://robertslab.github.io/resources/
19 stars 11 forks source link

Deduplication for C.v analysis #669

Closed yaaminiv closed 5 years ago

yaaminiv commented 5 years ago

I deduplicated my files in bismark:

The script deduplicate_bismark is supposed to remove alignments to the same position in the genome from the Bismark mapping output (both single and paired-end SAM/BAM files), which can arise by e.g. excessive PCR amplification. Sequences which align to the same genomic position but on different strands are scored individually.

@kubu4 pointed out the manual does not recommend deduplication for reduced representation methods:

This step is recommended for whole-genome bisulfite samples, but should not be used for reduced representation libraries such as RRBS, amplicon or target enrichment libraries.

I'm not entirely sure why deduplication would be a bad thing for MBD (I tried tagging Mac in this issue but couldn't find her username because I figured she may have insight)

yaaminiv commented 5 years ago

From this issue

The reason for this is rather pragmatic: whenever you expect a very high coverage of a region, and you do not include unique molecular identifiers (UMIs) in your reads, you cannot tell whether a read was a genuine read from a different cell, or if it was a PCR duplicate.

De-duplication which is based on mapping position (and orientation) alone will only allow 1 alignment to a given position. In cases where you can only ever sequence a small number of different fragments (for RRBS this is ~600,000), you would start discarding reads as soon as each fragment was covered once. In other words: the deeper you sequence, the more data would get discarded.

yaaminiv commented 5 years ago

Looks like I should rerun bismark (or at least rerun a subset without deduplication and compare it to a subset with deduplication). Thoughts @sr320 / @kubu4?

kubu4 commented 5 years ago

This looks like a good paper to explain this type of analysis:

https://www.nature.com/articles/nrg3273

Read the section: Processing enrichment-based data.

Actually, I guess you should probably read the entire thing. But, that section addresses why deduplication should/shouldn't be performed on enriched samples (I think).

sr320 commented 5 years ago

I still do not fully follow logic - to me it would always keep more than one alignment.... but sure go ahead and test

NOTE you do not have to rerun heavy step of alignment -

Should be quick to do locally.


${bismark_dir}/bismark_methylation_extractor \
--bedGraph --counts --scaffolds \
--multicore 14 \
--buffer_size 75% \
*bam

# Sort files for methylkit and IGV

find *.bam | \
xargs basename -s .bam | \
xargs -I{} ${samtools} \
sort --threads 28 {}.bam \
-o {}.sorted.bam

# Index sorted files for IGV
# The "-@ 16" below specifies number of CPU threads to use.

find *.sorted.bam | \
xargs basename -s .sorted.bam | \
xargs -I{} ${samtools} \
index -@ 28 {}.sorted.bam
yaaminiv commented 5 years ago
kubu4 commented 5 years ago

Having read more of that Nature paper, I am feeling confident that deduplication is the way to go. I'm basing this conclusion on the way they distinguish "bisulfite sequencing" and "enrichment-based methods".

Here's info from Box 1 of that paper:

Bisulphite sequencing

Chemical treatment of the DNA with sodium bisulphite gives rise to methylation- specific sequence variants, which can be mapped and quantified by next-generation sequencing. Key advantages of this technology are its comprehensive genomic coverage, high quantitative accuracy and excellent reproducibility. Disadvantages include the high cost of whole-genome resequencing and the difficulty of discriminating between 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC)144,145. Bisulphite sequencing can be combined with enrichment strategies using restriction enzymes (in reduced-representation bisulphite sequencing (RRBS)146) or DNA fragment capture147,148 to target bisulphite sequencing to a specific fraction of the genome and thereby to reduce the per-sample cost of sequencing.

Enrichment-based methods

DNA-methylation-specific antibodies, methyl-binding domain proteins or restriction enzymes are used to enrich for a fraction of highly methylated (or unmethylated) DNA fragments, and the enrichment of specific fragments is quantified by next-generation sequencing. The two key advantages of enrichment-based methods are the relatively low cost of achieving genome-wide coverage (albeit with a low statistical power in CpG-poor genomic regions64) and the ability to distinguish between different forms of DNA methylation: for example, using antibodies that specifically recognize 5hmC but not 5mC. However, these advantages come at the cost of relatively low resolution and high susceptibility to experimental biases.

My take on this is that there are groups that employ enrichment-based methods, but do not perform bisulfite conversion.

Since we do put our samples through bisulfite conversion, our samples fall into the "bisulphite conversion" description, which is not "enrhichment-based".

yaaminiv commented 5 years ago

Bisulphite sequencing can be combined with enrichment strategies using restriction enzymes (in reduced-representation bisulphite sequencing (RRBS)146) or DNA fragment capture147,148 to target bisulphite sequencing to a specific fraction of the genome and thereby to reduce the per-sample cost of sequencing.

That's a pretty accurate description of what we did. I think I'm going to stick with deduplicated samples.

yaaminiv commented 5 years ago

Although...RRBS is still a bisulfite sequencing method, correct? Why would that not need deduplication?

sr320 commented 5 years ago

Sounds good

On Apr 4, 2019, 1:18 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:

Bisulphite sequencing can be combined with enrichment strategies using restriction enzymes (in reduced-representation bisulphite sequencing (RRBS)146) or DNA fragment capture147,148 to target bisulphite sequencing to a specific fraction of the genome and thereby to reduce the per-sample cost of sequencing. That's a pretty accurate description of what we did. I think I'm going to stick with deduplicated samples. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sr320 commented 5 years ago

It’s all explained in the paper if you really want to know On Apr 4, 2019, 1:19 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:

Although...RRBS is still a bisulfite sequencing method, correct? Why would that not need deduplication? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mgavery commented 5 years ago

It doesn't have to to with bisulfite sequencing or not, it has to do with type of enrichment - specifically with a restriction enzyme that cuts the same spot. For RRBS, my sequences will always start with the restriction site, then when I sequence 100bp out from there I am definitely going to get duplicated sequences because they will all line up to those cut sites. I have the same issue when I do RAD-Seq and only sequence from 1 end. I can't deduplicate because all my reads will line up at the same place. If you're doing whole genome or an enrichment protocol where you've randomly sheared across the genome (or if you've done paired end sequencing for RAD), then you wouldn't expect to get sequences that are EXACTLY the same - unless they are PCR duplicates (at which point if you include the duplicates you are biasing your methylation call to the same fragment that just happened to get amplified lots of times, which is bad).

TLDR - I think you should deduplicate your MBD data since it's randomly fragmented

yaaminiv commented 5 years ago

Got it. Thanks @mgavery !

sr320 commented 5 years ago

Thanks! Now I understand! On Apr 4, 2019, 2:10 PM -0700, mgavery notifications@github.com, wrote:

It doesn't have to to with bisulfite sequencing or not, it has to do with type of enrichment - specifically with a restriction enzyme that cuts the same spot. For RRBS, my sequences will always start with the restriction site, then when I sequence 100bp out from there I am definitely going to get duplicated sequences because they will all line up to those cut sites. I have the same issue when I do RAD-Seq and only sequence from 1 end. I can't deduplicate because all my reads will line up at the same place. If you're doing whole genome or an enrichment protocol where you've randomly sheared across the genome (or if you've done paired end sequencing for RAD), then you wouldn't expect to get sequences that are EXACTLY the same - unless they are PCR duplicates (at which point if you include the duplicates you are biasing your methylation call to the same fragment that just happened to get amplified lots of times, which is bad). TLDR - I think you should deduplicate your MBD data since it's randomly fragmented — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.