Please add context to descibe the probabilistic imputation.

qingjian1991 commented 2 years ago

Hi @aet21 :

In the third step: "Development and application of probabilistic imputation model" of your GB paper, the DNAm levels of unexpressed genes are estimated by a 2-state mixture of gamma distributions of all expressed genes. Users may want to learn how to build this model from the expression matrices of SCM2 and RMAP. I found that the DNAm levels, the gene expression levels, and the final posterior probabilities are stored in the data("dbSCM2") and data("dbRMAP"). The final posterior probabilities are inferred from the gene expression matrix based on the 2-state mixture of gamma distributions. Could you add some example codes to describe how to infer the posterior probabilities? Your open code will provide valuable learning resources for beginners in bioinformatics.

Below is some code, I have tried.

topIntMR.m <- topIntMRrmap.m
intDNAm.m <- betaRMAPint.m
intExpR.m <- expRMAPint.m
pEgX.m <- pEgXrmap.m

#Get the gene expressions.
exp =c(intExpR.m); exp = exp[!is.na(c(exp))]

#The 2 gramma distributions. We assume one is an exponential distribution.
#One is exponential distribution. a = 1, b = 1/0.235734
mean(exp[exp <= 1.1]) #0.235734

#define function to calculate mode
find_mode <- function(x) { u <- unique(x); tab <- tabulate(match(x, u));u[tab == max(tab)]}

#The other one is gammam distribution.
x = mean(exp[exp > 1.1]) #3.66
y = find_mode(exp[exp > 1.1]) #1.186501 

#get the alpha and beta values.
a = x/(x-y) #1.479176 1.429793
b = 1/(x-y) #0.4038569 0.3903738

#Fit the model, using the estimated prior parameters.
out = gammamixEM(exp,  alpha = c(1, 1.48), beta = c(4.24, 0.40),  epsilon = 1e-06)

out$lambda
#[1] 0.94017109 0.05982891

out$gamma.pars
#comp.1         comp.2
#alpha 1.000000 1.48000000
#beta  2.670065 0.01694109

a = c(1, 1.48); b = c(2.67, 0.0169)
x = exp

#The posterior probabilities
P_exp = dgamma(x, shape = a[2],  rate = b[2])*0.06/( dgamma(x, shape = a[2],  rate = b[2] )*0.06  + dgamma(x, shape = a[1],  rate = b[1])*0.94)

plot(density(P_exp))

aet21 commented 2 years ago

For a given imputable marker gene, the procedure to determine which samples are expressed and which ones are not, is very clearly explained in the Genome Biology paper. This is basic 2-state mixture modelling stuff. For the two databases SCM2 and RMAP, these analyses are precomputed for each imputable gene, as there is no point recalculating these every single time you want to use EpiSCORE, i.e these analyses are completely separate from the actual imputation of the DNAm reference matrix.

qingjian1991 commented 2 years ago

For a given imputable marker gene, the procedure to determine which samples are expressed and which ones are not, is very clearly explained in the Genome Biology paper. This is basic 2-state mixture modelling stuff. For the two databases SCM2 and RMAP, these analyses are precomputed for each imputable gene, as there is no point recalculating these every single time you want to use EpiSCORE, i.e these analyses are completely separate from the actual imputation of the DNAm reference matrix.

Thanks, I just want to learn how to do the bayesian inference according to the two-components distributions. I have know how to do that according to this site: https://www.cs.toronto.edu/~rgrosse/csc321/mixture_models.pdf. By the way, my code provides the right soultion in R.

aet21 / EpiSCORE

Please add context to descibe the probabilistic imputation. #5