bst.predict function has very long runtimes when likelihood of GPModel is set to bernoulli_logit

myng-ahn commented 1 year ago

Hi @fabsig, thank you for your work!

I am using GPBoost mixed effects model with two random effect variables (among 51 total variables) to predict a binary dependent variable using the bernoulli logit link function as the likelihood for GPModel.

I noticed that the prediction times using bst.predict increased drastically (7 min when n=30,580 vs 75 min when n=124,114), especially compared to using the Gaussian likelihood (4 sec when n = 30,580 vs 5 sec when n=124,114).

Do you know what may be the explanation for this discrepancy and how we can resolve this?

Code:

# X_train.shape : (124114, 51)
# X_test,shape : (30580, 51)

def inv_logit(log_odds):
    return 1 / (1 + exp(-1 * log_odds))

##### PARAMS #####
# determined by parameter optimization step beforehand
params = {
    'eta': 0.015,
    'max_depth': 2,
    'colsample_bytree': 0.6,
    'min_child_weight': 1,
    'subsample': 0.75,
    'verbosity': 1,
    'lambda': 1,
    'alpha': 0.001,
    'num_leaves': 4
}
nrounds = 1200

##### TRAIN MODEL ######
train_data = gpb.Dataset(X_train, label=y_train)
train_data = train_data.set_categorical_feature(categorical_fts)
gaussian_model = gpb.GPModel(likelihood="gaussian", group_data=group_train)
bernli_logit_model = gpb.GPModel(likelihood="bernoulli_logit", group_data=group_train)

start = time.time()
gaussian_bst = gpb.train(params=params, train_set=train_data, gp_model=gaussian_model, num_boost_round=nrounds)
gaussian_train_time = time.time() - start

start = time.time()
bernli_logit_bst = gpb.train(params=params, train_set=train_data, gp_model=bernli_logit_model, num_boost_round=nrounds)
bernli_logit_train_time = time.time() - start

##### PREDICT ON TEST DATA, N=30,580 #####
start = time.time()
gaussian_pred = gaussian_bst.predict(data=X_test, group_data_pred=group_test)
gaussian_pred_time = time.time() - start
gaussian_y_pred = gaussian_pred["response_mean"]

start = time.time()
bernli_logit_pred = bernli_logit_bst.predict(data=X_test, group_data_pred=group_test, pred_latent=True, predict_var=True)
bernli_logit_pred_time = time.time() - start
bernli_logit_y_pred = np.array([inv_logit(i) for i in bernli_logit_pred["fixed_effect"] + bernli_logit_pred["random_effect_mean"]])

##### PREDICT ON TRAIN DATA N=124,114 #####
start = time.time()
gaussian_train_pred = gaussian_bst.predict(data=X_train, group_data_pred=group_train)
gaussian_train_pred_time = time.time() - start
gaussian_train_y_pred = gaussian_train_pred["response_mean"]

start = time.time()
bernli_logit_train_pred = bernli_logit_bst.predict(data=X_train, group_data_pred=group_train, pred_latent=True, predict_var=True)
bernli_logit_train_pred_time = time.time() - start
bernli_logit_train_y_pred = np.array([inv_logit(i) for i in bernli_logit_train_pred["fixed_effect"] + bernli_logit_train_pred["random_effect_mean"]])

##### RUNTIMES #####
# GAUSSIAN:
#   Test:   3.5591 sec
#   Train:  5.6411 sec
# BERNOULLI LOGIT:
#   Test:   7.025 min
#   Train:  75.782 min

Windows 10 AMD Ryzen 5 4500U python 3.9 numpy 1.24.3 pandas 1.4.2

fabsig commented 1 year ago

Thanks a lot for using GPBoost!

Dealing with non-Gaussian models is much more complex compare to Gaussian likelihoods (both estimation and prediction). GPBoost currently uses a Laplace approximation and there are several time consuming steps which include, in particular, the finding of the mode. How long calculations take is not mainly a function of the data size but of the number of levels per random effect and the structure of the random effects. If there is only one random effect, things should be fast (even when there are many levels). If there are more than one random effects, things can become slow. The computational time mainly depends on the number of levels / categories per random effect and on how the random effects are structured (nested is much easier than crossed).

Below is an example (sorry for using R, I realized too late that you use Python, but its the same computations anyway...), in which predictions are done in approx. 2 minutes for n=100'000 and two crossed random effects with m = 1'000 levels each (I do not use the boosting functionality, only the random effects model / GPModel, but that's the slow part). If you increase the number of levels (m), then the time increases. For instance, for m=10'000, it goes up to 20 minutes on my laptop.

Having said that, there are options on how to speed things up for non-Gaussian likelihoods with multiple random effects. Unfortunately, I did not have the time to look into them and develop something for GPBoost. A simple hack that always works is to sub-sample our training data.


library(gpboost)

make_data <- function(n, m, randef, likelihood, has_F = FALSE){
  sigma2_1 <- 0.5^2 # random effect variance
  group <- rep(1,n) 
  for(i in 1:m) group[((i-1)*n/m+1):(i*n/m)] <- i
  b1 <- sqrt(sigma2_1) * rnorm(m)
  if(randef == "One_random_effect"){
    eps <- b1[group]
    group_data <- group
  } else if(randef == "Two_randomly_crossed_random_effects"){
    group2 <-  group[sample.int(n=n, size=n, replace=FALSE)]
  } else if(randef == "Two_nested_random_effects"){
    m_nested <- m*2 # number of categories / levels for the second nested grouping variable
    group2 <- rep(1,n)  # grouping variable for nested lower level random effects
    for(i in 1:m_nested) group2[((i-1)*n/m_nested+1):(i*n/m_nested)] <- i
  }
  if(randef != "One_random_effect"){
    b2 <- sqrt(sigma2_1) * rnorm(length(unique(group2)))
    if (length(unique(group2)) != max(group2)) stop("not all levels samples -> gives index problems")
    eps <- b1[group] + b2[group2]
    group_data <- cbind(group,group2)
  }
  #   Simulate fixed effects
  if (has_F) {
    X <- matrix(runif(2*n),ncol=2)
    f1d <- function(x) 1/(1+exp(-(x-0.5)*10)) - 0.5
    f   <- f1d(X[,1])
  }else{
    X <- matrix(rep(0,2*n),ncol=2)
    f <- 0
  }
  #   Simulate response variable
  if (likelihood == "bernoulli_probit") {
    probs <- pnorm(f+eps)
    y <- as.numeric(runif(n) < probs)
  } else if (likelihood == "bernoulli_logit") {
    probs <- 1/(1+exp(-(f+eps)))
    y <- as.numeric(runif(n) < probs)
  } else if (likelihood == "poisson") {
    mu <- exp(f+eps)
    y <- qpois(runif(n), lambda = mu)
  } else if (likelihood == "gamma") {
    mu <- exp(f+eps)
    y <- qgamma(runif(n), scale = mu, shape = 1)
  } else if (likelihood == "gaussian") {
    mu <- f + eps
    sigma2 <- 0.5^2 # error variance
    y <- sqrt(sigma2) * rnorm(n) + mu
  }
  list(X=X, y=y, group_data = group_data)
}

likelihood <- "bernoulli_logit"
randef <- "Two_randomly_crossed_random_effects"
n <- 100000
m = 1000
sim_data <- make_data(n=n, m=m, randef=randef, likelihood=likelihood, has_F=FALSE)
gp_model <- GPModel(group_data = sim_data$group_data, likelihood = likelihood)
start <- Sys.time()
y_pred <- predict(gp_model, group_data_pred=sim_data$group_data, y = sim_data$y, cov_pars = c(1,1))
end <- Sys.time()
print(end-start)

# Time difference of 2.703677 mins

fabsig commented 1 year ago

Update: it seems that there is an easy fix to make things faster for prediction. The above said still holds true regarding the dependence on the number and structure of random effects. Will keep you updated.

fabsig commented 1 year ago

I made some improvements and prediction is faster now with version 1.2.6 (on PyPI). For instance, on the above simulated data, making predictions now takes only approx. 5 secs instead of 2 mins. Let me know how it works on your data.

fabsig / GPBoost

bst.predict function has very long runtimes when likelihood of GPModel is set to bernoulli_logit #114