al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
210 stars 96 forks source link

It seems sth wrong with calculateDiffMeth() and getMethyDiff() in MethylKit #284

Closed DengEr-1993 closed 1 year ago

DengEr-1993 commented 1 year ago

Hello, @alexg9010

I met a problem about MethylKit recently.

You know there are two key steps like below: 1). myDiff_new = calculateDiffMeth(meth_new, covariates = covariates, overdispersion="MN", test="Chisq", num.cores = 10) 2). myDiff_all = getMethylDiff(myDiff_new, difference = differenceVal, qvalue = 0.1)

And I have several samples datasets and united them together and then use them in steps above.

#######

I used default parameters: myDiff_new = calculateDiffMeth(meth_new, num.cores = 10) and: myDiff_new = calculateDiffMeth(meth_new, covariates = covariates, overdispersion="MN", test="Chisq", num.cores = 10), respectively.

I can reproduce the same results with somebody by the default parameters. But I can not reproduce the same results through the second myDiff_new = calculateDiffMeth(...)

I don't know why and I didn't change any input data or parameters.

I checked this package version. It updated 6 years ago the last time. So it must not because of the version.

I have also thought that I made mistakes but I can reproduce the same result by default parameters.

I even suspected if somebody used different parameters that he didn't tell me ?

I can not contact with him so I want to know if you can give me some advice?

I will appreciate it if you can give me some advice.

It's funny that I am in Charite Mitte where is just near your working place.

Thanks in advance. Xiangyi

alexg9010 commented 1 year ago

Hi @DengEr-1993

I do not fully understand what the problem is that your are reporting here. Are you saying that the following code is generating different results on different machines, while default settings are reproducible?

calculateDiffMeth(meth_new, covariates = covariates, overdispersion="MN", test="Chisq", num.cores = 10)

Could you please add some example output of what is different? Also please check the loaded methylkit versions using the sessioninfo.

Best, Alex

DengEr-1993 commented 1 year ago

Thank you for your advice. And I see the last MethlKit package version in Github is v0.99.2.

But in Bioconductor, it is v1.8.1 which was updated in 2019.

So which one is the latest version and which platform is the main platform, Bioconductor or Github ?

Thank you.

Best, xiangyi

alexg9010 commented 1 year ago

Hi @DengEr-1993,

The release on GitHub is not up-to-date, since the package is generally distributed via Bioconductor. That being said, the most recent version on Bioconductor is actually 1.26.0 (https://bioconductor.org/packages/release/bioc/html/methylKit.html) which should be available on recent versions of R.

Best, Alex

DengEr-1993 commented 1 year ago

Thank you. Got it.

I want to install specific version of methylKit through Bioconductor but I didn't find this version 1.18.0.

So how can I find it and install it ?

Another question is about DMR calling

I tried several versions of methyKit and it meant that I can get the same results by using calculateDiffMeth() and getMethylDiff() functions under the same parameters.

So if I have used the latest version methlKit, in order to keep the same parameters with before, shall I downgrade the version to before. Because I have produced some results with old version methlKit.

Thank you in advance.

Best, xiangyi

DengEr-1993 commented 1 year ago

Sorry @alexg9010 , I forget one important thing:

I used two different parameters to do DMR calling: one is : myDiff_new = calculateDiffMeth(meth_new, num.cores = 10) and then: myDiff_all = getMethylDiff(myDiff_new, difference = 10, qvalue = 0.1)

another one: myDiff_new = calculateDiffMeth(meth_new, covariates = covariates, overdispersion="MN", test="Chisq", num.cores = 10) and then: myDiff_all = getMethylDiff(myDiff_new, difference = 10, qvalue = 0.1)

But the number of dmrids from using the two steps with different parameters are quite different. I saw your tutorila before but I am no sure whether these parameters especially "covariates = covariates, overdispersion="MN"" are necessary ? And can you give me some advice on this point ?

Thank you.

Best, xiangyi

alexg9010 commented 1 year ago

Hi @DengEr-1993,

There might be two ways in to install a specific version of methylKit:

Using Matching Bioc Version

First, figure out the matching version of Bioconductor and then downgrade to the corresponding version (see https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html#changing-version).

Using Conda

You should be able to install a specific version from conda/bioconda (see https://bioconda.github.io/recipes/bioconductor-methylkit/README.html)

alexg9010 commented 1 year ago

Concerning the question about differential methylation analysis, the function calculateDiffMeth should return roughly the same output if the same arguments are used, independent of the version.

That being said, it is expected that the use of covariates and overdispersion control changes the results compared to the default function call.

In general, I would suggest to use the default parameters unless you have reasons to not do so.

Usually you would include covariates in your analysis if your samples differ by some random variable independent of your treatment vector, like age or concentration of treatment. If covariates are included in the analysis, the function will try to separate the influence of the covariates from the treatment effect via a linear model.

The overdispersion correction you would enable if the data is more variable than expected under the assumed distribution. This can occur due to various reasons, such as technical artifacts, batch effects, or biological heterogeneity. If set to "MN" the basic overdispersion correction, proposed by McCullagh and Nelder (1989) will be applied. This correction applies a scaling parameter to variance estimated by the model.

I hope this helps.

Best, Alex

DengEr-1993 commented 1 year ago

Thank you @alexg9010 . What you said above is vital to my analysis.

Just like your explanation, I used default parameters: myDiff_new = calculateDiffMeth(meth_new, num.cores = 10) and: myDiff_new = calculateDiffMeth(meth_new, covariates = covariates, overdispersion="MN", test="Chisq", num.cores = 10), respectively.

#######################################

Another question is about the second one:

I have a RRBS result from several tissue samples.

And because there are some samples from the same donor, so here I used donor IDs as covariates.

For example:

There are 7 samples: Bm Bm, Bl, Bl, Bl, Bl, Bl and the donor IDs or sample IDs like: Sample_A, Sample_B, Sample_A, Sample_B, Sample_C, Sample_D, Sample_E

the former two belong to Bm, and the latter five belong to Bl. There are some tissues came from the same samples in terms of tissue vs bl comparison.

covariates=data.frame(donorID=c("Sample_A", "Sample_B", "Sample_A", "Sample_B", "Sample_C", "Sample_D", "Sample_E"))

So, I created this information as covariates (like the age covariates in your tutorial, is that ok?).

Besides, in order to minimize the intragroup variance. This correction applies a scaling parameter to variance estimated by the model overdispersion="MN".

Then run it: myDiff_new = calculateDiffMeth(meth_new, covariates = covariates, overdispersion="MN", test="Chisq", num.cores = 10),

So do you think these parameters are reasonable in calculateDiffMeth() ?

And I got less dmrids results from the second one calculation compared to the former one.

I know you said that "In general, I would suggest to use the default parameters unless you have reasons to not do so."

I am looking forward to your response.

Many thanks in advance.

Best, xiangyi.

alexg9010 commented 1 year ago

Hi @DengEr-1993,

I think covariates should be continuous values like age, weight, etc, so I don't think the donor id should be used as a covariate here.

If at all, it could be possible that you would need to account for donor id as a batch effect. One way to check this could be via the PCA (PCASamples()) or clustering (clusterSamples())plots, where you would see obvious clustering by donors instead of treatments. If this is the case you could try to remove batch effect, via methylKit (very rudimentary) or other methods like sva/combat.

Best, Alex

DengEr-1993 commented 1 year ago

Thanks @alexg9010 , I got it.