fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
530 stars 42 forks source link

Vechia approximation with panel data, #132

Closed imadoualid closed 4 months ago

imadoualid commented 4 months ago

Hello sir thank u for the gpboost package, i have a question about the vechia approximation, i'm using a function to find the best paramaeters using the grid_search, i'am also using the vechia approximation :

def search_best_params(X_train, X_train_coord, y_train):

    gp_model = gpb.GPModel(gp_coords=X_train_coord, gp_approx="vecchia",)
    data_train = gpb.Dataset(X_train,y_train)

    param_grid = {
        #'n_estimators': [50, 100, 250, 500,1000],
        'min_child_samples': [1,5, 10, 15, 25, 50, 100],
        'max_depth': [1,3,5,10],
        'learning_rate': [1, 0.1, 0.01, 0.001, 0.0001],
        'num_leaves': [2**10, 2**17],
        'lambda_l2': [0.001,0.1, 0,1,10],
        'first_metric_only' : [True]
        #'boosting': ['gbdt']
    }

    # Other parameters not contained in the grid of tuning parameters
    params = { 'objective': 'regression_l2',
               'n_jobs':10}

    opt_params = gpb.grid_search_tune_parameters(
      param_grid=param_grid, params=params,
      nfold=3,
      gp_model=gp_model, train_set=data_train,
      verbose_eval=3,num_try_random=1,
      num_boost_round=1000, early_stopping_rounds=20,
      seed=42, metric=['rmse','test_neg_log_likelihood' ],)

    print("Best number of iterations: " + str(opt_params['best_iter']))
    print("Best score: " + str(opt_params['best_score']))
    print("Best parameters: " + str(opt_params['best_params']))

    return opt_params['best_params'], opt_params['best_iter']```

However depending on the dataset i'am using, i'am having some warning : 
```grid searching ......
[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
Starting random grid search with 1 trials out of 1400 parameter combinations 
Trying parameter combination 1 of 1: {'min_child_samples': 1, 'max_depth': 10, 'learning_rate': 0.001, 'num_leaves': 1024, 'lambda_l2': 0.0, 'first_metric_only': True}
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Total Bins 359
[GPBoost] [Info] Number of data points in the train set: 30168, number of used features: 79
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Total Bins 353
[GPBoost] [Info] Number of data points in the train set: 30168, number of used features: 79
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Total Bins 352
[GPBoost] [Info] Number of data points in the train set: 30168, number of used features: 79
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] [GPBoost with gaussian likelihood]: initscore=11.966429
[GPBoost] [Info] Start training from score 11.966429
[GPBoost] [Info] [GPBoost with gaussian likelihood]: initscore=11.963212
[GPBoost] [Info] Start training from score 11.963212
[GPBoost] [Info] [GPBoost with gaussian likelihood]: initscore=11.965751
[GPBoost] [Info] Start training from score 11.965751
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[1] cv_agg's rmse: 0.379734 cv_agg's test_neg_log_likelihood: 0.421238
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[2] cv_agg's rmse: 0.378342 cv_agg's test_neg_log_likelihood: 0.417429
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[3] cv_agg's rmse: 0.376945 cv_agg's test_neg_log_likelihood: 0.413648
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[4] cv_agg's rmse: 0.37555  cv_agg's test_neg_log_likelihood: 0.409857
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.
[GPBoost] [Warning] Calculation of (only) predictive variances is currently not optimized for the Vecchia approximation, and this might takes a lot of time and/or memory.

i was wondering why sometimes i have this warning and sometimes not ? if i change the dataset i'am not having it :+1:

[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
Starting random grid search with 1 trials out of 1400 parameter combinations 
Trying parameter combination 1 of 1: {'min_child_samples': 1, 'max_depth': 10, 'learning_rate': 0.001, 'num_leaves': 1024, 'lambda_l2': 0.0, 'first_metric_only': True}
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Total Bins 254
[GPBoost] [Info] Number of data points in the train set: 4432, number of used features: 67
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Total Bins 259
[GPBoost] [Info] Number of data points in the train set: 4432, number of used features: 67
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Starting nearest neighbor search for Vecchia approximation
[GPBoost] [Info] Nearest neighbors for Vecchia approximation found
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] Total Bins 253
[GPBoost] [Info] Number of data points in the train set: 4432, number of used features: 65
[GPBoost] [Warning] Find whitespaces in feature_names, replace with underlines
[GPBoost] [Info] [GPBoost with gaussian likelihood]: initscore=12.147779
[GPBoost] [Info] Start training from score 12.147779
[GPBoost] [Info] [GPBoost with gaussian likelihood]: initscore=12.144678
[GPBoost] [Info] Start training from score 12.144678
[GPBoost] [Info] [GPBoost with gaussian likelihood]: initscore=12.144654
[GPBoost] [Info] Start training from score 12.144654
[1] cv_agg's rmse: 0.419133 cv_agg's test_neg_log_likelihood: 0.540981
[2] cv_agg's rmse: 0.417655 cv_agg's test_neg_log_likelihood: 0.537433
[3] cv_agg's rmse: 0.416166 cv_agg's test_neg_log_likelihood: 0.533858
[4] cv_agg's rmse: 0.414672 cv_agg's test_neg_log_likelihood: 0.530257
[5] cv_agg's rmse: 0.413194 cv_agg's test_neg_log_likelihood: 0.526683
[6] cv_agg's rmse: 0.411691 cv_agg's test_neg_log_likelihood: 0.523039
[7] cv_agg's rmse: 0.410204 cv_agg's test_neg_log_likelihood: 0.51943
[8] cv_agg's rmse: 0.408715 cv_agg's test_neg_log_likelihood: 0.515777
[9] cv_agg's rmse: 0.407217 cv_agg's test_neg_log_likelihood: 0.512149
fabsig commented 4 months ago

Thanks a lot for using GPBoost and for reporting this issue!

This is a legacy warning and can be ignored. I will disable it in the next release. The warning is triggered when the test / validation data sample size exceeds a certain number. Previously, the calculation of predictive variances (which are needed for the metric test_neg_log_likelihood) was slow and memory intensive for a large number of prediction points. But this is issue has been solved some time ago (let me know if you still have problems ...)