fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
530 stars 42 forks source link

Error in initialize(...) : std::bad_alloc #121

Closed juandavidgutier closed 6 months ago

juandavidgutier commented 6 months ago

Hello, I am new in GPBoost, but I am trying to use it for an epidemiological research with grouped and areal spatial data.

My binary response variable is "excess", the observations are grouped by municipality, and it corresponds to the variable "Code.DANE". "Lat" and "Long" are the coordinates of municipalities, and the covariates are: 'Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', and 'Month'. When I try to create a model with the function GPModel, I get the error message: "Error in initialize(...) : std::bad_alloc"

Here is the dataset that I am using data_question.csv

And this is my code `

library(gpboost) library(ggplot2) library(gridExtra) library(viridis) library(sf) library(dplyr) library(tidyr)

data <- read.csv("D:/data_question.csv")

data_AirTemp <- select(data, 'excess', 'Code.DANE', 'Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', 'Month', 'Long','Lat')

median_SOI <- median(data_AirTemp$SOI, na.rm = TRUE) data_AirTemp$SOI <- ifelse(data_AirTemp$SOI >= median_SOI, 1, 0) median_SST3 <- median(data_AirTemp$SST3, na.rm = TRUE) data_AirTemp$SST3 <- ifelse(data_AirTemp$SST3 >= median_SST3, 1, 0) median_SST4 <- median(data_AirTemp$SST4, na.rm = TRUE) data_AirTemp$SST4 <- ifelse(data_AirTemp$SST4 >= median_SST4, 1, 0) median_SST34 <- median(data_AirTemp$SST34, na.rm = TRUE) data_AirTemp$SST34 <- ifelse(data_AirTemp$SST34 >= median_SST34, 1, 0) median_SST12 <- median(data_AirTemp$SST12, na.rm = TRUE) data_AirTemp$SST12 <- ifelse(data_AirTemp$SST12 >= median_SST12, 1, 0) median_NATL <- median(data_AirTemp$NATL, na.rm = TRUE) data_AirTemp$NATL <- ifelse(data_AirTemp$NATL >= median_NATL, 1, 0) median_SATL <- median(data_AirTemp$SATL, na.rm = TRUE) data_AirTemp$SATL <- ifelse(data_AirTemp$SATL >= median_SATL, 1, 0) median_TROP <- median(data_AirTemp$TROP, na.rm = TRUE) data_AirTemp$TROP <- ifelse(data_AirTemp$TROP >= median_TROP, 1, 0) median_forest <- median(data_AirTemp$forest, na.rm = TRUE) data_AirTemp$forest <- ifelse(data_AirTemp$forest >= median_forest, 1, 0)

data_AirTemp_na <- data_AirTemp %>% drop_na()

data_AirTemp_na <- as.matrix(data_AirTemp_na[,names(data_AirTemp_na)]) covars <- c('Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', 'Month')

Choosing tuning parameters (HERE IS THE ERROR)

gp_model <- gpboost::GPModel(group_data = data_AirTemp_na[, "Code.DANE"],
gp_coords = data_AirTemp_na[, c("Long", "Lat")], likelihood = "bernoulli_logit", cov_function = "exponential")

boost_data <- gpboost::gpb.Dataset(data = data_AirTemp_na[, covars], label = data_AirTemp_na[, "excess"]) param_grid = list("learning_rate" = c(1,0.1,0.01), "min_data_in_leaf" = c(10,100,1000), "max_depth" = c(1,2,3,5,10), "lambda_l2" = c(0,1,10)) other_params <- list(num_leaves = 2^10) set.seed(1) opt_params <- gpboost::gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params, num_try_random = 25, nfold = 4, data = boost_data, gp_model = gp_model, nrounds = 50, early_stopping_rounds = 10, verbose_eval = 1, metric = "auc") #by the way, is this metric OK? opt_params

`

fabsig commented 6 months ago

Thanks a lot for using GPBoost!

Your data has approx. 180'000 samples. The reason for the error is that a very large (covariance) matrix (177096 x 177096) which is way too large for you memory (RAM) is created. With this data, the problem will disappear if you have only grouped OR only spatial GP random effects. For instance, you can move the grouped random effects into the fixed effects part as follows and model the grouped effects with the categorical_feature option (or also without it...):

covars <- c('Code.DANE','Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', 'Month')
gp_model <- gpboost::GPModel(gp_coords = data_AirTemp_na[, c("Long", "Lat")],
                             likelihood = "bernoulli_logit", cov_function = "exponential") 
boost_data <- gpboost::gpb.Dataset(data = data_AirTemp_na[, covars], label = data_AirTemp_na[, "excess"],
                                   categorical_feature = 1)

This works since you have only approx. 1000 unique spatial locations. If the number of unique locations were much larger, you would need a GP approximation (see #111).

Note that, since the number of groups / categories is also relatively small, it would be possible, technically, to run such a model with both grouped AND spatial GP random effects for your data. However, this is not implemented in GPBoost and would require quite some software engineering, and my time is scarce...

juandavidgutier commented 6 months ago

Hello Fabio,

Thanks a lot for your cooperation and clear explanation.

fabsig commented 6 months ago

Your welcome!