al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
203 stars 96 forks source link

Filtering by SD after unite, by group; covariate in beta-binomial model #298

Closed desmodus1984 closed 10 months ago

desmodus1984 commented 1 year ago

Hi, I did whole-genome bisulfite sequencing, and called methylation with MethylDackel. I have 3 groups that I want to compare, and I want to use the most robust method to call differentially methylated regions - Betabinomial- model. I read the files, and then filtered by depth and normalize by coverage. Since I am doing pairwise comparison, for each pair-batch, I united them: meth.fil.norm=unite(myobj.filt.norm,min.per.group=3L,mc.cores=6) I read that one can estimate the standard deviation (SD) and then use it to filter reads. I tried this code,

# get percent methylation matrix
pm=percMethylation(meth)

# calculate standard deviation of CpGs
sds=matrixStats::rowSds(pm)

# Visualize the distribution of the per-CpG standard deviation
# to determine a suitable cutoff
hist(sds, breaks = 100)

# keep only CpG with standard deviations larger than 2%
meth <- meth[sds > 2]

and the percentage methylation contains a lot of NA that affect the final calculation of the SD, hence, I couldn't filter using SD.

So, I moved forward without filtering reads by SD. Following, I used the beta-binomial model to estimate differentially methylated regions, using this code: DiffMeth.fil.norm=calculateDiffMethDSS(meth.fil.norm, adjust="BH", mc.cores=12)

and I got only two DMS with a qvalue of 0.01. I talked with a collaborator, and he said that he used methylKit and found way more DMS. He said that he used a covariate, and I found that I can't include a covariate for the beta-binomial model.

Also, I found that during the differential methylation calling, it doesn't make a difference whether one uses 1 core or more. I checked the resource usage and it didn't make a difference to include the mc.cores=12, CPU usage didn't go up beyond 5%, which is the normal CPU usage. FYI, in my machine I have 24 cores.

Thank you very much;

alexg9010 commented 1 year ago

Hi,

Concerning the filtering based on SD the problem seems to occur during the slicing of the meth object, which does not support NA values. You may use meth[which(sds > 2)] to mitigate this.

Unfortunately our current implementation of the beta-binomial model does not support inclusion of covariates and parallel computation has also not been implemented yet (mc.cores argument is only placeholder). If you require the use of covariates for your model, you may have to use the DSS package directly.

Best, Alex

desmodus1984 @.***> schrieb am Di., 5. Sept. 2023, 19:49:

Hi, I did whole-genome bisulfite sequencing, and called methylation with MethylDackel. I have 3 groups that I want to compare, and I want to use the most robust method to call differentially methylated regions - Betabinomial- model. I read the files, and then filtered by depth and normalize by coverage. Since I am doing pairwise comparison, for each pair-batch, I united them: meth.fil.norm=unite(myobj.filt.norm,min.per.group=3L,mc.cores=6) I read that one can estimate the standard deviation (SD) and then use it to filter reads. I tried this code,

get percent methylation matrix

pm=percMethylation(meth)

calculate standard deviation of CpGs

sds=matrixStats::rowSds(pm)

Visualize the distribution of the per-CpG standard deviation

to determine a suitable cutoff

hist(sds, breaks = 100)

keep only CpG with standard deviations larger than 2%

meth <- meth[sds > 2]

and the percentage methylation contains a lot of NA that affect the final calculation of the SD, hence, I couldn't filter using SD.

So, I moved forward without filtering reads by SD. Following, I used the beta-binomial model to estimate differentially methylated regions, using this code: DiffMeth.fil.norm=calculateDiffMethDSS(meth.fil.norm, adjust="BH", mc.cores=12)

and I got only two DMS with a qvalue of 0.01. I talked with a collaborator, and he said that he used methylKit and found way more DMS. He said that he used a covariate, and I found that I can't include a covariate for the beta-binomial model.

Also, I found that during the differential methylation calling, it doesn't make a difference whether one uses 1 core or more. I checked the resource usage and it didn't make a difference to include the mc.cores=12, CPU usage didn't go up beyond 5%, which is the normal CPU usage. FYI, in my machine I have 24 cores.

Thank you very much;

— Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/298, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK7JD77H2ALFYGVR62YB43XY5Q4HANCNFSM6AAAAAA4MEMDSY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

desmodus1984 commented 11 months ago

Hi,

Thank you for the solution, but I don't think it will mitigate the problem because the percMethylation matrix from uniting without groups is different from the matrix with uniting with group minimum, and the latter has NAs; so the filtering won't be the same.

In addition, is it possible to get the methylation metrics for a single site/CpG, instead of per-sample? As suggested elsewhere, the logistic model doesn't take into account the variability of the data, and moreover, the issue of my data, is that my samples are hypomethylation, and the beta-binomial seems to be better. Thus, I would check/plot the histogram of meth-metrics per-groups for individual sites/CpG.

Lastly, Is there a way to export the methylation matrix after filtration/normalization? I would like to run some machine learning algorithms on them, to confirm my results.

Thank you very much;

Juan Pablo Aguilar Cabezas

Juan Pablo Aguilar Cabezas

Ecology and Evolutionary Biology Ph.D. Candidate

Department of Biological Sciences

Ohio University, Athens OH


From: Alexander Blume @.> Sent: Wednesday, September 6, 2023 9:25 AM To: al2na/methylKit @.> Cc: Aguilar Cabezas, Juan Pablo @.>; Author @.> Subject: [External] Re: [al2na/methylKit] Filtering by SD after unite, by group; covariate in beta-binomial model (Issue #298)

Use caution with links and attachments.

Hi,

Concerning the filtering based on SD the problem seems to occur during the slicing of the meth object, which does not support NA values. You may use meth[which(sds > 2)] to mitigate this.

Unfortunately our current implementation of the beta-binomial model does not support inclusion of covariates and parallel computation has also not been implemented yet (mc.cores argument is only placeholder). If you require the use of covariates for your model, you may have to use the DSS package directly.

Best, Alex

desmodus1984 @.***> schrieb am Di., 5. Sept. 2023, 19:49:

Hi, I did whole-genome bisulfite sequencing, and called methylation with MethylDackel. I have 3 groups that I want to compare, and I want to use the most robust method to call differentially methylated regions - Betabinomial- model. I read the files, and then filtered by depth and normalize by coverage. Since I am doing pairwise comparison, for each pair-batch, I united them: meth.fil.norm=unite(myobj.filt.norm,min.per.group=3L,mc.cores=6) I read that one can estimate the standard deviation (SD) and then use it to filter reads. I tried this code,

get percent methylation matrix

pm=percMethylation(meth)

calculate standard deviation of CpGs

sds=matrixStats::rowSds(pm)

Visualize the distribution of the per-CpG standard deviation

to determine a suitable cutoff

hist(sds, breaks = 100)

keep only CpG with standard deviations larger than 2%

meth <- meth[sds > 2]

and the percentage methylation contains a lot of NA that affect the final calculation of the SD, hence, I couldn't filter using SD.

So, I moved forward without filtering reads by SD. Following, I used the beta-binomial model to estimate differentially methylated regions, using this code: DiffMeth.fil.norm=calculateDiffMethDSS(meth.fil.norm, adjust="BH", mc.cores=12)

and I got only two DMS with a qvalue of 0.01. I talked with a collaborator, and he said that he used methylKit and found way more DMS. He said that he used a covariate, and I found that I can't include a covariate for the beta-binomial model.

Also, I found that during the differential methylation calling, it doesn't make a difference whether one uses 1 core or more. I checked the resource usage and it didn't make a difference to include the mc.cores=12, CPU usage didn't go up beyond 5%, which is the normal CPU usage. FYI, in my machine I have 24 cores.

Thank you very much;

— Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/298, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK7JD77H2ALFYGVR62YB43XY5Q4HANCNFSM6AAAAAA4MEMDSY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/al2na/methylKit/issues/298#issuecomment-1708476510, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJWD2VK3AGAKWBIJCU7P4HDXZCBUXANCNFSM6AAAAAA4MEMDSY. You are receiving this because you authored the thread.Message ID: @.***>

alexg9010 commented 11 months ago

Hi @desmodus1984

If you want to get group-wise summaries, you could simply pool samples per group. The function sums up coverage, numCs and numTs values within each group so one representative sample for each group will be created in a new methylBase object.

library(methylKit)
data("methylKit")

## pool per treatment group
mb_p <- pool(methylBase.obj,sample.ids = c("1","0"))
## extract %-methylation matrix
p_pm <- percMethylation(mb_p, rowids = TRUE)
## lineplot of methylation per group
p_pm |>
  head(50) |>
  matplot(type = "l", 
          main = "group-wise methylation signal", 
          ylab = "%-methylation",
          xlab = "CpG-Position")

Alternatively you could also calculate per-group methylation by aggregating mean methylation per group:

library(methylKit)

data("methylKit")

## extract perc methylation matrix (CpG x sample)
pm <- percMethylation(methylBase.obj, rowids = TRUE)
## get group membership
groups <- setNames(getTreatment(methylBase.obj), getSampleID(methylBase.obj))

## aggregate as mean per group (Group x CpG)
pm_aggr_t <- pm |>
  # head() |>
  as.data.frame() |> 
  setNames(as.character(groups)) |> 
  t() |> 
  aggregate(by = list(groups = groups), mean)

## transpose and fix colnames
pm_aggr <- pm_aggr_t |> 
  t() |>
  as.data.frame() |> 
  setNames(pm_aggr_t$groups) |>
  subset(colnames(pm_aggr_t) != "groups") 

## plot as mean per group
pm_aggr |>
  head(50) |>
  matplot(type = "l", 
          main = "group-wise methylation signal", 
          ylab = "%-methylation",
          xlab = "CpG-Position")

Lastly, Is there a way to export the methylation matrix after filtration/normalization?

To export the methylation matrix just save into a text file with write.table.

Best, Alex