fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
530 stars 42 forks source link

Inconsistency between Log Odds and Probability (mu) predictions from GLMM logit models #104

Closed flare-hawk closed 11 months ago

flare-hawk commented 11 months ago

I was using GPBoost in python for a GLMM Logit regression. Python: 3.7.16 R: 4.2.1

GPBoost: 1.2.3 glmmTMB: 1.1.7

The coefficients from GPModel is very similar to that estimated from glmmTMB. So, not surprisingly, the linear predictions from GPModel is also very similar to that of glmmTMB. However, when I predict response using GPModel (I assumed it is probability), it gives very weird result.

For instance, the prediction from a model estimated from the same data on the same observation: GPModel xb (linear prediction): -9.428315 response: 0.02483198

glmmTMB xb: -9.563228 response: 0.0000702607

Sorry, I don't have an sample data at the moment, but I can dig around and find one if necessary. problem

fabsig commented 11 months ago

Thank you for your feedback.

The response prediction (probability here) depends on the latent mean and the variance. It is a one-dimensional integral based on the latent mean and variance. What is the predicted latent variance in your case for GPBoost and glmmTMB? If you tell me this, I can double check (e.g., on my pocket calculator) whether the response prediction is correct. In GPBoost, you get the latent variance as follows:

pred = predict(gp_model, group_data_pred = group_data, X_pred = x1, 
                     predict_response = FALSE, predict_var = TRUE)
pred$var

FWIW: you mention that you use Python, but your code is R code.

flare-hawk commented 11 months ago

Oh, I am sorry for the confusion. I was using Python but there is no other point of comparison for me at the moment on the Python side. I went back to R to do a comparison. Both versions of GPBoost gives same results as for the prediction . This is the prediction from Python code on the same data: image

BTW, I think according to the documentation of linear prediction (predict_response=False) of GPBoost, it is also based on the fixed + random effects.

Here is the prediction of latent variance from GPBoost and glmmTMB in R. image

In addition, here is a summary of the mean value of probability predicted from glmmTMB (pred3) and GPboost (pred1) compared to the mean value of Y. The one from GPBoost is order of magnitude different. image

I was also doing some sanity check for myself. I used the predicted latent variable mean (pred4) from glmmTMB as log odds to covert it back to probability ( exp(log odds)/(1+exp(log odds)). It was extremely close to the predicted probability (pred3).

image

fabsig commented 11 months ago

The prediction you obtain with glmmTMB is simply the latent mean prediction passed to the (inverse) link function (1 / (1 + exp(-mu_lat)) ). This is wrong for latent Gaussian models, such as random effects models, as it ignores variability in the random effects (posterior). As I said, you need to calculate an integral to get the correct prediction. This is basic knowledge concerning random effects models (see, e.g., Equation (19) in Sigrist (2023, TPAMI)). In the code below, I compare the two versions, and I get the same result with numeric integration as you obtain with GPBoost. You might want to ask the people at glmmTMB. I have never used this package and can therefore not provide you any explanation for this, but I am sure they can help you and explain this.

mu_lat <- -9.42
var_lat <- 19.75
f <- function(x) 1/(1+exp(-(x))) * dnorm(x, mean = mu_lat, sd = sqrt(var_lat))
integrate(f, -100, 100) # correct prediction
# 0.02492965 with absolute error < 7.2e-05
1 / (1 + exp(-mu_lat)) # wrong prediction
[1] 8.107944e-05

It is unclear to me why in your example the mean of the wrong predictions are closer to the mean of your response variable data. There might be many potential reasons (e.g., wrong model specification, but this is pure speculation...). In any case, simply because you are closer to what you expect to see does, in my opinion, not justify to use a wrong method (it might be pure luck). Below is a simulated example where the response data mean is closer to the correctly predicted response variable compared to the wrong prediction.

library(gpboost)
# Simulate data
n <- 10000 # number of samples
m <- 1000 # number of categories / levels for grouping variable
group <- rep(1,n) # grouping variable
for(i in 1:m) group[((i-1)*n/m+1):(i*n/m)] <- i
set.seed(4)
b <- rnorm(m)
rand_eff <- b[group]
# Simulate linear regression fixed effects
X <- cbind(rep(1,n),runif(n)-0.5) # design matrix / covariate data for fixed effects
beta <- c(-4,2) # regression coefficients
lp <- X %*% beta
probs <- 1 / (1 + exp(-(lp + rand_eff)))
y <- as.numeric(runif(n) < probs)
mean(y)
# [1] 0.0349
# Estimate model
gp_model <- fitGPModel(group_data = group, y = y, X = X, likelihood = "bernoulli_logit")
# Correct response prediction
pred_resp <- predict(gp_model, X_pred = X, group_data_pred = group,
                     predict_response = TRUE)
mean(pred_resp$mu)
# [1] 0.0387637
# Wrong response prediction
pred_lat <- predict(gp_model, X_pred = X, group_data_pred = group,
                    predict_response = FALSE)
pred_resp_wrong <- 1 / (1 + exp(-pred_lat$mu))
mean(pred_resp_wrong)
# [1] 0.0273712