al2na / methylKit

R package for DNA methylation analysis
https://bioconductor.org/packages/release/bioc/html/methylKit.html
204 stars 96 forks source link

Suggestion: predicted effect size corrected for unbalanced covariates #224

Open therealgenna opened 3 years ago

therealgenna commented 3 years ago

Hi,

When methylation difference is calculated there are 3 options for effect size calculation: weighted mean, unweighted mean and model prediction (predicted). All of these options calculate or estimate effect size (i.e., methylation difference) which is due both to the group (e.g., treatment vs. control) as well as due to covariates, when they are present/used, and if they are unbalanced (i.e., different) across the treatment groups.

I suggest an additional simple calculation for an average effect size prediction, corrected for covariates (named predicted2 below). Basically, all covariate sets found in any of the treatment groups will be used in all groups in making model predictions. The rest is the same. In that case the predicted effect will not be due to difference in covariates, will be based on covariate sets used in the data, and will agree with reported statistical significance for group effect. This is also what might be of interest to the user (as a note, it might make sense to return all effect size calculations together, not just one, so that there is no need to rerun the function to get a different version of effect size estimate.)

Here is the code I suggest to be added to the logReg function (starting with predicted2 below, as an additional possibility in the switch statement:

  meths=switch(effect,
               wmean={
                 #ms=tapply(y, as.factor(vars$treatment), sum )
                 ms=tapply(y,treatment,sum)
                 #ws=tapply(w, as.factor(vars$treatment), sum )
                 ws=tapply(w,treatment,sum)
                 ms/ws  
               },
               mean={
                 tapply(prop, treatment, FUN = mean)
               },
               predicted={
                 tapply(mu, treatment, FUN = mean)
               },
               predicted2={
                 # Add effect estimate calculated on balanced set of covariates.
                 # Use sets of covariates seen in any treatment group for each group:
                 covars = vars[,-1,drop=FALSE]
                 # if treatment is a factor, ii will be a factor with same levels:
                 ii = sort(unique(treatment))
                 newdata = cbind(treatment=ii[1], covars)
                 for(i in ii[-1]) newdata = rbind(newdata, cbind(treatment=i, covars))
                 # predict:
                 # (this should take care of factor covariates as well)
                 newMat = model.matrix(formula, as.data.frame(newdata))
                 eta <- as.matrix(newMat) %*% as.matrix(obj$coef)
                 mu_pred = as.vector(obj$family$linkinv(eta))
                 tapply(mu_pred, newdata$treatment, FUN=mean)
               }
  )
al2na commented 3 years ago

this could be useful yes, would you like to do a PR? also any ideas on how to return the calculated affect sizes without breaking the object structure and downstream functions?

Best, Altuna

On Sat, Mar 6, 2021 at 5:03 AM therealgenna notifications@github.com wrote:

Hi,

When methylation difference is calculated there are 3 options for effect size calculation: weighted mean, unweighted mean and model prediction ( predicted). All of these options calculate or estimate effect size (i.e., methylation difference) which is due both to the group (e.g., treatment vs. control) as well as due to covariates, when they are present/used, and if they are unbalanced (i.e., different) across the treatment groups.

I suggest an additional simple calculation for an average effect size prediction, corrected for covariates (named predicted2 below). Basically, all covariate sets found in any of the treatment groups will be used in all groups in making model predictions. The rest is the same. In that case the predicted effect will not be due to difference in covariates, will be based on covariate sets used in the data, and will agree with reported statistical significance for group effect. This is also what might be of interest to the user (as a note, it might make sense to return all effect size calculations together, not just one, so that there is no need to rerun the function to get a different version of effect size estimate.)

Here is the code I suggest to be added to the logReg function (starting with predicted2 below, as an additional possibility in the switch statement:

meths=switch(effect, wmean={

ms=tapply(y, as.factor(vars$treatment), sum )

             ms=tapply(y,treatment,sum)
             #ws=tapply(w, as.factor(vars$treatment), sum )
             ws=tapply(w,treatment,sum)
             ms/ws
           },
           mean={
             tapply(prop, treatment, FUN = mean)
           },
           predicted={
             tapply(mu, treatment, FUN = mean)
           },
           predicted2={
             # Add effect estimate calculated on balanced set of covariates.
             # Use sets of covariates seen in any treatment group for each group:
             covars = vars[,-1,drop=FALSE]
             # if treatment is a factor, ii will be a factor with same levels:
             ii = sort(unique(treatment))
             newdata = cbind(treatment=ii[1], covars)
             for(i in ii[-1]) newdata = rbind(newdata, cbind(treatment=i, covars))
             # predict:
             # (this should take care of factor covariates as well)
             newMat = model.matrix(formula, as.data.frame(newdata))
             eta <- as.matrix(newMat) %*% as.matrix(obj$coef)
             mu_pred = as.vector(obj$family$linkinv(eta))
             tapply(mu_pred, newdata$treatment, FUN=mean)
           }

)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE32EK2NINFEBSBP6WPMYTTCGSQZANCNFSM4YWJQDXA .

therealgenna commented 3 years ago

Ok, thanks, sounds good - I'll do a PR.

Regarding the multiple effect sizes - I'm not an expert in the code but I'd think that if you add more columns to the data.frame / methylDiff.obj it might be fine. So you'd keep the current 7 columns (including the 7th named meth.diff) but add one more column for each version of effect, named correspondingly (so one column will be identical to meth.diff.) Or you can add another option for effect='all', which might not work right away with some downstream stuff but you'd explain in the help page what it means and user can manually rename columns according to their needs ... Or maybe you could add the data.frame with all effect versions as an attribute to the methDiff object - internally you'd first calculate all versions and put them as extra columns, but before returning the object you could move these extra columns to an attribute. Again, these are just my thoughts on this.