GPBoostError: Check failed:

fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models

Other

556 stars 44 forks source link

import gpboost as gpb import numpy as np import matplotlib.pyplot as plt from scipy import stats import pandas as pd likelihood = "bernoulli_probit" keep=pd.read_csv('keep.csv') coords_train=np.c_[keep['half'].astype(float),keep['x'].astype(float),keep['y'].astype(float)] params = {'learning_rate': 0.1, 'objective': likelihood, 'verbose': 0, 'monotone_constraints': [1, 0]} num_boost_round = 25 params['objective'] = 'binary' # make a small subset to check X_train=coords_train[:10000,:] # make a small subset to check y_train=keep['classification_for_VTK_number'].values y_train[y_train!=3]=0 y_train[y_train==3]=1 y_train=y_train[:10000] gp_model = gpb.GPModel(gp_coords=coords_train[:10000,:], cov_function="exponential", likelihood=likelihood) # Create dataset for gpb.train data_train = gpb.Dataset(X_train, y_train) bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=num_boost_round) gp_model.summary() # Estimated random effects model

As the error message says, your number of monotone constraints does not correspond to the number of features in X_train. Dropping monotone_constraints from params will fix the issue. I.e., replace with

params = {'learning_rate': 0.1, 'objective': likelihood, 'verbose': 0}

Note: monotone_constraints was included in the demo examples. I realize that this might create confusion and have removed it in the demo.

Note: using Gaussian processes on large data requires some approximation. Your current code will not finish in a reasonable amount of time. The go-to option for large data in GPBoost is a Vecchia approximation:

gp_model = gpb.GPModel(gp_coords=coords_train[:10000,:], cov_function="exponential",
                       likelihood=likelihood, vecchia_approx=True, num_neighbors=15)

Unfortunately, in the current implementation, the Vecchia approximation does not work nicely for non-Gaussian data including binary data. An alternative is to use a compactly supported covariance function such as tapered one or a Wendland covariance function. The latter can be done as follows:

gp_model = gpb.GPModel(gp_coords=coords_train[:10000,:], cov_function="wendland",
                       likelihood=likelihood, cov_fct_taper_range=10)

You might need to try different values for cov_fct_taper_range, or better tune it. We are currently working on a better large data solution for non-Gaussian data. But at this point, I cannot give any indication when this will be released.

Concerning your second question. X_train and coords_train do not need to be different. They can be the same. The demo contains an example of spatial data where coords_train contain spatial coordinates and X_train contains other features / covariates. This is a typical situation in spatial statistics. However, in general, X_train and coords_train do not need to be different.

fabsig / GPBoost

GPBoostError: Check failed: #66