"Strange" CpG methylation distribution --- barely any non-methylated read

franrodalg commented 3 years ago

Hi @FelixKrueger !!

A while ago you recommended us a trimming method that worked wonders for our library prep. Some of our collaborators have followed the same procedure, and they get proper alignment rates and I think bisulfite conversion also looks good. Nevertheless, they feel the CpG methylation is suspiciously high:

Final Alignment report

======================

Sequence pairs analysed in total:       147040897
Number of paired-end alignments with a unique best hit: 109536212
Mapping efficiency:     74.5%
Sequence pairs with no alignments under any condition:  28176414
Sequence pairs did not map uniquely:    9328271
Sequence pairs which were discarded because genomic sequence could not be extracted:    153

Number of sequence pairs with unique best (first) alignment came from the bowtie output:

CT/GA/CT:       54936016        ((converted) top strand)
GA/CT/CT:       0       (complementary to (converted) top strand)
GA/CT/GA:       0       (complementary to (converted) bottom strand)
CT/GA/GA:       54600043        ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total:     0 

Final Cytosine Methylation Report

=================================

Total number of C's analysed:   5284524534

Total methylated C's in CpG context:    196109804
Total methylated C's in CHG context:    6840567
Total methylated C's in CHH context:    25136941
Total methylated C's in Unknown context:        214460

Total unmethylated C's in CpG context:  42386804
Total unmethylated C's in CHG context:  1082311243
Total unmethylated C's in CHH context:  3931739175
Total unmethylated C's in Unknown context:      1499757

C methylated in CpG context:    82.2%
C methylated in CHG context:    0.6%
C methylated in CHH context:    0.6%
C methylated in unknown context (CN or CHN):    12.5%

Bismark completed in 1d 0h 44m 3s

but the most suspicious part is the CpG density plot, since there seems to be virtually no lowly methylated reads.

meth_density

Does this look normal to you, or do you think there has been an issue with the library or its processing?

Cheers, Fran

FelixKrueger commented 3 years ago

Hi Fran,

I don't know which organism you are dealing with, but in case this is a differentiated human sample I don't think that the CpG methylation is suspiciously high, but is seems to be within the realm of 'normal' average methylation levels.

If you have an average methylation of 82%, then naturally a lot of the density curve would be expected in the 75-100% part of the plot, and not very much in the low methylation part. I think the picture would look different if you would specifically look in lowly methylated regions such as CpG islands. Also, since your plot says "whole genome >10 reads" I would be worried that this depth filtering could potentially have some impact as well. Does the plot look similar if you don't apply depth filtering? The reason I am asking is that 110 million aligned reads are unlikely to be in the region of a a 10-fold coverage (1-4X seems more likely), so applying this kind of depth coverage might unduly select for regions that show a different methylation than the rest of the genome.

But again, since your overall CpG methylation is so high I don't think that coverage depth thresholding is responsible for much of effect here. Chances are that it is all as expected.

pplaw commented 3 years ago

Hi Felix,

Thank you so much for your response! We are working on human blood (buffy coat).

And when read counts filter is removed, we got the following density curve:

And I have also tried to remove the peak at 100%me to have a closer look at the other peaks:

We have then however generated M-bias plot for this sample for further QC:

CpG R1:

CpG R2:

Does these look to you that we are having some problems with R2 reads and they need to be trimmed further?

Thanks a lot!

Cheers, Pui

FelixKrueger commented 3 years ago

Hi Pui,

That looks reassuring. The other peaks you are seeing seem to correspond to positions with a defined number of C's per position so 50% (1:1 methylated/unmethylated), 33% or 66% (1:2 or 2:1), 25% and 75% (1:3 or 3:1), again with a tendency towards higher methylation. Such 'peaks' would be filtered out with a 10 read threshold.

The M-bias plots look amazingly flat throughout at the same level in R1 and R2. The very last position of R2 always appears as 100% methylated, simply because the Illumina adapter starts with AGATC..., meaning R2 may never end in A (which means it may not be found unmethylated). This is consequence of adapter trimming, and is nothing you do about it really. If you look at the number of times this happens (just mouse over that position in the Bismark or MultiQC HTML report), the number is usually orders of magnitude lower (mostly because of the overlap detection and removal), and thus completely negligible.

I remain firm: your data looks great, go and get some nice results! :)

franrodalg commented 3 years ago

Thanks so so much Felix. I think they feel much more confident at the moment that the results are believable. Your insights are always invaluable.

pplaw commented 3 years ago

Thanks a lot for your help Felix!

FelixKrueger commented 3 years ago

I'm glad it helped.

FelixKrueger / Bismark

"Strange" CpG methylation distribution --- barely any non-methylated read #406