juandavidgutier commented 11 months ago

Hello, I am new in GPBoost, but I am trying to use it for an epidemiological research with grouped and areal spatial data.

My binary response variable is "excess", the observations are grouped by municipality, and it corresponds to the variable "Code.DANE". "Lat" and "Long" are the coordinates of municipalities, and the covariates are: 'Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', and 'Month'. When I try to create a model with the function GPModel, I get the error message: "Error in initialize(...) : std::bad_alloc"

Here is the dataset that I am using data_question.csv

And this is my code `

library(gpboost) library(ggplot2) library(gridExtra) library(viridis) library(sf) library(dplyr) library(tidyr)

data <- read.csv("D:/data_question.csv")

data_AirTemp <- select(data, 'excess', 'Code.DANE', 'Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', 'Month', 'Long','Lat')

median_SOI <- median(data_AirTemp$SOI, na.rm = TRUE) data_AirTemp$SOI <- ifelse(data_AirTemp$SOI >= median_SOI, 1, 0) median_SST3 <- median(data_AirTemp$SST3, na.rm = TRUE) data_AirTemp$SST3 <- ifelse(data_AirTemp$SST3 >= median_SST3, 1, 0) median_SST4 <- median(data_AirTemp$SST4, na.rm = TRUE) data_AirTemp$SST4 <- ifelse(data_AirTemp$SST4 >= median_SST4, 1, 0) median_SST34 <- median(data_AirTemp$SST34, na.rm = TRUE) data_AirTemp$SST34 <- ifelse(data_AirTemp$SST34 >= median_SST34, 1, 0) median_SST12 <- median(data_AirTemp$SST12, na.rm = TRUE) data_AirTemp$SST12 <- ifelse(data_AirTemp$SST12 >= median_SST12, 1, 0) median_NATL <- median(data_AirTemp$NATL, na.rm = TRUE) data_AirTemp$NATL <- ifelse(data_AirTemp$NATL >= median_NATL, 1, 0) median_SATL <- median(data_AirTemp$SATL, na.rm = TRUE) data_AirTemp$SATL <- ifelse(data_AirTemp$SATL >= median_SATL, 1, 0) median_TROP <- median(data_AirTemp$TROP, na.rm = TRUE) data_AirTemp$TROP <- ifelse(data_AirTemp$TROP >= median_TROP, 1, 0) median_forest <- median(data_AirTemp$forest, na.rm = TRUE) data_AirTemp$forest <- ifelse(data_AirTemp$forest >= median_forest, 1, 0)

data_AirTemp_na <- data_AirTemp %>% drop_na()

data_AirTemp_na <- as.matrix(data_AirTemp_na[,names(data_AirTemp_na)]) covars <- c('Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', 'Month')

Choosing tuning parameters (HERE IS THE ERROR)

gp_model <- gpboost::GPModel(group_data = data_AirTemp_na[, "Code.DANE"],
gp_coords = data_AirTemp_na[, c("Long", "Lat")], likelihood = "bernoulli_logit", cov_function = "exponential")

boost_data <- gpboost::gpb.Dataset(data = data_AirTemp_na[, covars], label = data_AirTemp_na[, "excess"]) param_grid = list("learning_rate" = c(1,0.1,0.01), "min_data_in_leaf" = c(10,100,1000), "max_depth" = c(1,2,3,5,10), "lambda_l2" = c(0,1,10)) other_params <- list(num_leaves = 2^10) set.seed(1) opt_params <- gpboost::gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params, num_try_random = 25, nfold = 4, data = boost_data, gp_model = gp_model, nrounds = 50, early_stopping_rounds = 10, verbose_eval = 1, metric = "auc") #by the way, is this metric OK? opt_params

`

fabsig commented 11 months ago

Thanks a lot for using GPBoost!

Your data has approx. 180'000 samples. The reason for the error is that a very large (covariance) matrix (177096 x 177096) which is way too large for you memory (RAM) is created. With this data, the problem will disappear if you have only grouped OR only spatial GP random effects. For instance, you can move the grouped random effects into the fixed effects part as follows and model the grouped effects with the categorical_feature option (or also without it...):

covars <- c('Code.DANE','Temp', 'SOI', 'SST3', 'SST4', 'SST34', 'SST12', 'NATL', 'SATL', 'TROP', 'forest', 'Year', 'Month')
gp_model <- gpboost::GPModel(gp_coords = data_AirTemp_na[, c("Long", "Lat")],
                             likelihood = "bernoulli_logit", cov_function = "exponential") 
boost_data <- gpboost::gpb.Dataset(data = data_AirTemp_na[, covars], label = data_AirTemp_na[, "excess"],
                                   categorical_feature = 1)

This works since you have only approx. 1000 unique spatial locations. If the number of unique locations were much larger, you would need a GP approximation (see #111).

Note that, since the number of groups / categories is also relatively small, it would be possible, technically, to run such a model with both grouped AND spatial GP random effects for your data. However, this is not implemented in GPBoost and would require quite some software engineering, and my time is scarce...

juandavidgutier commented 11 months ago

Hello Fabio,

Thanks a lot for your cooperation and clear explanation.

fabsig commented 11 months ago

Your welcome!

fabsig / GPBoost

Error in initialize(...) : std::bad_alloc #121

Choosing tuning parameters (HERE IS THE ERROR)