fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
574 stars 46 forks source link

Using gpboost for spatial data (real estate) #111

Closed imadoualid closed 1 year ago

imadoualid commented 1 year ago

Hello, i'm using gpboost to try and train a house prices model, i'm having this error

GPBoostError                              Traceback (most recent call last)
Cell In[40], line 1
----> 1 gp_model = gpb.GPModel(gp_coords=X_train_coord )
      2 data_train = gpb.Dataset(X_train[['scaled_surface', 'is_annex', 'is_premise', 'part_count']],y_train)
      3 params = { 'objective': 'regression_l2', 'learning_rate': 0.01,
      4             'max_depth': 3, 'min_data_in_leaf': 10, 
      5             'num_leaves': 2**10, 'verbose': 1 , 
      6          }

File /projects/effidata/repository/.venv/lib/python3.10/site-packages/gpboost/basic.py:4487, in GPModel.__init__(self, likelihood, group_data, group_rand_coef_data, ind_effect_group_rand_coef, drop_intercept_group_rand_effect, gp_coords, gp_rand_coef_data, cov_function, cov_fct_shape, gp_approx, cov_fct_taper_range, cov_fct_taper_shape, num_neighbors, vecchia_ordering, num_ind_points, matrix_inversion_method, seed, cluster_ids, free_raw_data, model_file, model_dict, vecchia_approx, vecchia_pred_type, num_neighbors_pred)
   4483     cluster_ids_c = cluster_ids.ctypes.data_as(ctypes.POINTER(ctypes.c_int32))
   4485 self.__determine_num_cov_pars(likelihood=likelihood)
-> 4487 _safe_call(_LIB.GPB_CreateREModel(
   4488     ctypes.c_int(self.num_data),
   4489     cluster_ids_c,
   4490     group_data_c,
   4491     ctypes.c_int(self.num_group_re),
   4492     group_rand_coef_data_c,
   4493     ind_effect_group_rand_coef_c,
   4494     ctypes.c_int(self.num_group_rand_coef),
   4495     drop_intercept_group_rand_effect_c,
   4496     ctypes.c_int(self.num_gp),
   4497     gp_coords_c,
   4498     ctypes.c_int(self.dim_coords),
   4499     gp_rand_coef_data_c,
   4500     ctypes.c_int(self.num_gp_rand_coef),
   4501     c_str(self.cov_function),
   4502     ctypes.c_double(self.cov_fct_shape),
   4503     c_str(self.gp_approx),
   4504     ctypes.c_double(self.cov_fct_taper_range),
   4505     ctypes.c_double(self.cov_fct_taper_shape),
   4506     ctypes.c_int(self.num_neighbors),
   4507     c_str(self.vecchia_ordering),
   4508     ctypes.c_int(self.num_ind_points),
   4509     c_str(likelihood),
   4510     c_str(self.matrix_inversion_method),
   4511     ctypes.c_int(self.seed),
   4512     ctypes.byref(self.handle)))
   4514 # Should we free raw data?
   4515 self.free_raw_data = free_raw_data

File /projects/effidata/repository/.venv/lib/python3.10/site-packages/gpboost/basic.py:145, in _safe_call(ret)
    137 """Check the return value from C API call.
    138 
    139 Parameters
   (...)
    142     The return value from C API calls.
    143 """
    144 if ret != 0:
--> 145     raise GPBoostError(_LIB.LGBM_GetLastError().decode('utf-8'))

GPBoostError: std::bad_alloc

The code i'm running is

X_train["x"]=X_train.geometry.map(lambda g: g.x)
X_train["y"]=X_train.geometry.map(lambda g: g.y)
X_train_coord = X_train[["x","y"]]

gp_model = gpb.GPModel(gp_coords=X_train_coord )
data_train = gpb.Dataset(X_train[['scaled_surface', 'is_annex', 'is_premise', 'part_count']],y_train)
params = { 'objective': 'regression_l2', 'learning_rate': 0.01,
            'max_depth': 3, 'min_data_in_leaf': 10, 
            'num_leaves': 2**10, 'verbose': 1 , 
         }
# Training
bst = gpb.train(params=params, train_set=data_train,  
                valid_sets=[data_train], valid_names=["train"],
                gp_model=gp_model, num_boost_round=100)
gp_model.summary() # Estimated covariance parameters

i think it's due to the data train which is of (179782, 82) but how to train for a large dataset then ?

fabsig commented 1 year ago

Thanks for your interest in GPBoost!

For Gaussian processes, one needs to use an approximation for large data (not just in GPBoost). Otherwise, a matrix of size n x n is constructed which overflows the memory. You can try

gp_model = gpb.GPModel(gp_coords=X_train_coord, gp_approx ="vecchia") (recommended)

or

gp_model = gpb.GPModel(gp_coords=X_train_coord, gp_approx ="tapering")

See also here for more details.

Depending on your computational resources and the data, a data set of size 180K might already be at the limit with gp_approx ="vecchia". You have to try... We are currently developing an alternative approximation that runs faster for which 180K should be no problem (should be ready in a few months).