al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
210 stars 96 forks source link

The deep theory of calculateDiffMeth() function of MethylKit #307

Closed DengEr-1993 closed 9 months ago

DengEr-1993 commented 9 months ago

Hello there,

I have a question about calculateDiffMeth() function in package MethylKit that I used it to do differential RRBS analysis but I am not so familiar with its deep theory.

Because I just want to know how does thecalculateDiffMeth() calculate or compare the difference between treatment and control groups.

I referred to the tutorial of methylKit and it said: Depending on the sample size per each set it will either use Fisher’s exact or logistic regression to calculate P-values. P-values will be adjusted to Q-values using SLIM method but I didn't get it in some degree.

For example, I have two groups with two replicates, respectively, treatment and control.

Then, I should convert CpGs to tiles with different window size, such as 500bp or 1000bp with definite filtering conditions.

But when running myDiff=calculateDiffMeth(meth) this step on tiling window I want to know the difference between treatment and control is based on the total CpG sites percentage (sum(freqC)/sum(Coverage)) of methylation level region or on the mean CpG sites percentage (mean(freqC)/mean(Coverage)) of the methylation level region ?

For this question, if not, if there is any other extra calculation theory such as scaling or normalization that may change the original value?

Because I visualized the dmr calling results from MethylKit via pheatmap and then I want to see this results in CpG results via IGV. But the two visualization methods didn't show us quite consistent result.

So could you give me some advice or how to explain this question ?

Thanks a lot.

Best, xiangyi

alexg9010 commented 9 months ago

Hi Xiangyi,

Thanks for reaching out, let me try to clear this up a bit.

You are analysing RRBS data comparing two groups with two replicates. You aggregated counts over tiling windows, which *sums up the methylated/unmethylated base counts over the *tilling windows accross genome (see code https://github.com/al2na/methylKit/blob/388770da40bde7d563224cc5d2259c2a7a6a2e08/R/regionalize.R#L121-L181), so methylation level per region would be the total CpG sites percentage (sum(freqC)/sum(Coverage)). Since you compared two groups with two replicates, the differential analysis based on the tiling windows will by default perform logistic regression based on the methylation percentage for each tile and test for the difference using Chisq-test (see code https://github.com/al2na/methylKit/blob/388770da40bde7d563224cc5d2259c2a7a6a2e08/R/diffMeth.R#L202 ). By default, there is no overdispersion correction enabled, but the calculated p-values are adjusted using the SLIM method.

To me this workflow sounds quite reasonable, but could you elaborate more on the inconsistent results that you observed? Maybe show some screenshots underlining your point?

Best, Alex

Am Mo., 8. Jan. 2024 um 11:34 Uhr schrieb DengEr-1993 < @.***>:

Hello there,

I have a question about calculateDiffMeth() function in package MethylKit that I used it to do differential RRBS analysis but I am not so familiar with its deep theory.

Because I just want to know how does the calculateDiffMeth() calculate or compare the difference between treatment and control groups.

I referred to the tutorial of methylKit and it said: Depending on the sample size per each set it will either use Fisher’s exact or logistic regression to calculate P-values. P-values will be adjusted to Q-values using SLIM method http://www.ncbi.nlm.nih.gov/pubmed/21098430 but I didn't get it in some degree.

For example, I have two groups with two replicates, respectively, treatment and control.

Then, I should convert CpGs to tiles with different window size, such as 500bp or 1000bp with definite filtering conditions.

But when running myDiff=calculateDiffMeth(meth) this step on tiling window I want to know the difference between treatment and control is based on the total CpG sites percentage (sum(freqC)/sum(Coverage)) of methylation level region or on the mean CpG sites percentage (mean(freqC)/mean(Coverage)) of the methylation level region ?

For this question, if not, if there is any other extra calculation theory such as scaling or normalization that may change the original value?

Because I visualized the dmr calling results from MethylKit via pheatmap and then I want to see this results in CpG results via IGV. But the two visualization methods didn't show us quite consistent result.

So could you give me some advice or how to explain this question ?

Thanks a lot.

Best, xiangyi

— Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK7JDZUKZAGGQRSH2D3E2DYNPDSZAVCNFSM6AAAAABBRHGU36VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3TAMJWGYYDKMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

DengEr-1993 commented 9 months ago

Hello Alex,

Thank you very much for your response.

Yes, we double checked our result that was the same to our original CpG data and just as you said above, I think I can understand the theory totally.

The reason why I post this question is because we want to see details of CpGs in IGV.

In IGV, we found that CpG sites of zero methylation level wouldn't be showed via bars. Some of us thought the same positions of one replicate sample were not detected by RRBS. So we might believe some of our RRBS data is of poor quality. But now we think our data is ok.

Now it is quite clear. Thanks again.

Have a nice day.

Best, xiangyi

alexg9010 commented 9 months ago

Another idea of how you could explore your results in IGV would be to export your objects as bedgraph files using the bedgraph() function.

You could export the methDiff objects using meth.diff column to visualize methylation difference as done in the original paper.

Usually in IGV, exploring the samples via the bismark bam files you should see coverage and methylation percentage just fine, but you could also create different bedgraph files from the methylRaw/methylRawList objects. Just explore the option of the bedgraph() function.

Best, Alex

DengEr-1993 commented 9 months ago

Another idea of how you could explore your results in IGV would be to export your objects as bedgraph files using the bedgraph() function.

You could export the methDiff objects using meth.diff column to visualize methylation difference as done in the original paper.

Usually in IGV, exploring the samples via the bismark bam files you should see coverage and methylation percentage just fine, but you could also create different bedgraph files from the methylRaw/methylRawList objects. Just explore the option of the bedgraph() function.

Best, Alex

Thanks for your advice. Have a nice day.