methylation % calculation for targeted region

prasundutta87 commented 4 years ago

Hi, I want to get the average methylation % for a targeted region, for example, see below...

Chrom	start	end	Meth_%	count_methylated	count_unmethylated
chr1	1371409	1371409	0	0	5
chr1	1371539	1371539	0	0	88
chr1	1371567	1371567	2.32	2	84
chr1	1371572	1371572	2.3	2	85
chr1	1371575	1371575	1.12	1	88

If my targeted region of interest is chr1:1371409-1371575 and each row corresponds to a methylated C (1-based BEDgraph coverage file from bismark), how should I proceed with my calculation?

1) Should I just take the mean of Meth % per locus? (Sum of Meth % values/5) i.e. (2.32+2.30+1.12)/5 or 1.148 %

OR

2) total count_methylated/total count_methylated+total count_unmethylated i.e. (2+2+1)/(2+2+1+5+88+84+85+88) = 0.014 or 1.4%

This one is more like a weighted mean (correct me if I am wrong).

OR

There is some other correct method to do this kind of targeted region-based calculation?

Regards, Prasun

FelixKrueger commented 4 years ago

Hi Prasun,

This is a very good question, and I find your suggestions very thoughtful. You do indeed have several options for this, and I don't think that there is a right, or even 'correct' way of doing this. There are simply several options, and each of them has their own advantages and disadvantages.

We personally lean towards using the option 1. (this is how it is also implemented in the bisulfite quantitation pipeline within SeqMonk), because averaging the percentages of several individual CpGs will take out extreme coverage outliers, like in this example: Say you have one residue that for whatever reason gets 10,000 methylation calls. The signal coming from this position (which is most likely some kind of amplification or mapping artefact) would completely swamp the % methylation calculation if you were using your method 2. If you calculate averages, such positions still only get 1/5th of a say in method 1.

While in the aforementioned case the result would probably be fairer using method 1, you do have the case in your data that position 1371409 only has 5 calls, while the other 4 positions have around 90 calls each. If you were using method 1, each position gets the same say, but you might argue that you don't trust the signal of position 1371409 equally as it wasn't covered to the same extent.

Both methods are valid and can be defended, but I don't think one is necessarily right, and the other one is wrong.

What we do is to require that the regions we are measuring are measured well enough, e.g. by designing probes over 100 consecutive CpGs (or some other reasonably large measurement area, such as CGIs, promoter regions etc.) and requiring that at least say 20 different positions within that window of measurement were covered. If we can afford it depth-wise we would also apply a coverage depth filter, e.g. 5 or 10 reads. In you case you seem to have some regions with plenty of coverage, so you might be able to apply such a depth filter to remove Cs which you don't trust.

While this tends to already give us fairly reliable results, we would still recommend that if you find interesting regions, you go back to the original data and convince yourself that the region looks credible and trustworthy, and that there is no artefact you hadn't foreseen.

Sorry if this answer isn't very concise, but we believe there are more ways to approach this...

prasundutta87 commented 4 years ago

This is really helpful, Felix. The reasons you have mentioned has indeed made things clear in my head. It would be indeed okay to get a depth-wise coverage filter as well because as you correctly mentioned, many CpGs are covered with a lot of reads except a few (We had performed a target enrichment protocol and I have performed deduplication using UMI barcodes)...but yes, going back and looking back at the original data is great advice, will definitely do that as a general QC.

FelixKrueger commented 4 years ago

Excellent, good luck!

FelixKrueger / Bismark

methylation % calculation for targeted region #354