Closed myng-ahn closed 10 months ago
Thanks a lot for using GPBoost!
Dealing with non-Gaussian models is much more complex compare to Gaussian likelihoods (both estimation and prediction). GPBoost currently uses a Laplace approximation and there are several time consuming steps which include, in particular, the finding of the mode. How long calculations take is not mainly a function of the data size but of the number of levels per random effect and the structure of the random effects. If there is only one random effect, things should be fast (even when there are many levels). If there are more than one random effects, things can become slow. The computational time mainly depends on the number of levels / categories per random effect and on how the random effects are structured (nested is much easier than crossed).
Below is an example (sorry for using R, I realized too late that you use Python, but its the same computations anyway...), in which predictions are done in approx. 2 minutes for n=100'000 and two crossed random effects with m = 1'000 levels each (I do not use the boosting functionality, only the random effects model / GPModel, but that's the slow part). If you increase the number of levels (m), then the time increases. For instance, for m=10'000, it goes up to 20 minutes on my laptop.
Having said that, there are options on how to speed things up for non-Gaussian likelihoods with multiple random effects. Unfortunately, I did not have the time to look into them and develop something for GPBoost. A simple hack that always works is to sub-sample our training data.
library(gpboost)
make_data <- function(n, m, randef, likelihood, has_F = FALSE){
sigma2_1 <- 0.5^2 # random effect variance
group <- rep(1,n)
for(i in 1:m) group[((i-1)*n/m+1):(i*n/m)] <- i
b1 <- sqrt(sigma2_1) * rnorm(m)
if(randef == "One_random_effect"){
eps <- b1[group]
group_data <- group
} else if(randef == "Two_randomly_crossed_random_effects"){
group2 <- group[sample.int(n=n, size=n, replace=FALSE)]
} else if(randef == "Two_nested_random_effects"){
m_nested <- m*2 # number of categories / levels for the second nested grouping variable
group2 <- rep(1,n) # grouping variable for nested lower level random effects
for(i in 1:m_nested) group2[((i-1)*n/m_nested+1):(i*n/m_nested)] <- i
}
if(randef != "One_random_effect"){
b2 <- sqrt(sigma2_1) * rnorm(length(unique(group2)))
if (length(unique(group2)) != max(group2)) stop("not all levels samples -> gives index problems")
eps <- b1[group] + b2[group2]
group_data <- cbind(group,group2)
}
# Simulate fixed effects
if (has_F) {
X <- matrix(runif(2*n),ncol=2)
f1d <- function(x) 1/(1+exp(-(x-0.5)*10)) - 0.5
f <- f1d(X[,1])
}else{
X <- matrix(rep(0,2*n),ncol=2)
f <- 0
}
# Simulate response variable
if (likelihood == "bernoulli_probit") {
probs <- pnorm(f+eps)
y <- as.numeric(runif(n) < probs)
} else if (likelihood == "bernoulli_logit") {
probs <- 1/(1+exp(-(f+eps)))
y <- as.numeric(runif(n) < probs)
} else if (likelihood == "poisson") {
mu <- exp(f+eps)
y <- qpois(runif(n), lambda = mu)
} else if (likelihood == "gamma") {
mu <- exp(f+eps)
y <- qgamma(runif(n), scale = mu, shape = 1)
} else if (likelihood == "gaussian") {
mu <- f + eps
sigma2 <- 0.5^2 # error variance
y <- sqrt(sigma2) * rnorm(n) + mu
}
list(X=X, y=y, group_data = group_data)
}
likelihood <- "bernoulli_logit"
randef <- "Two_randomly_crossed_random_effects"
n <- 100000
m = 1000
sim_data <- make_data(n=n, m=m, randef=randef, likelihood=likelihood, has_F=FALSE)
gp_model <- GPModel(group_data = sim_data$group_data, likelihood = likelihood)
start <- Sys.time()
y_pred <- predict(gp_model, group_data_pred=sim_data$group_data, y = sim_data$y, cov_pars = c(1,1))
end <- Sys.time()
print(end-start)
# Time difference of 2.703677 mins
Update: it seems that there is an easy fix to make things faster for prediction. The above said still holds true regarding the dependence on the number and structure of random effects. Will keep you updated.
I made some improvements and prediction is faster now with version 1.2.6 (on PyPI). For instance, on the above simulated data, making predictions now takes only approx. 5 secs instead of 2 mins. Let me know how it works on your data.
Hi @fabsig, thank you for your work!
I am using GPBoost mixed effects model with two random effect variables (among 51 total variables) to predict a binary dependent variable using the bernoulli logit link function as the likelihood for GPModel.
I noticed that the prediction times using
bst.predict
increased drastically (7 min when n=30,580 vs 75 min when n=124,114), especially compared to using the Gaussian likelihood (4 sec when n = 30,580 vs 5 sec when n=124,114).Do you know what may be the explanation for this discrepancy and how we can resolve this?
Code:
Windows 10 AMD Ryzen 5 4500U python 3.9 numpy 1.24.3 pandas 1.4.2