fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
556 stars 44 forks source link

GPBoostError: Check failed: #66

Closed mariosgeo closed 2 years ago

mariosgeo commented 2 years ago

keep.csv Hello, I want to do a Combined tree-boosting and Gaussian process model on the data above.

import gpboost as gpb
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd

likelihood = "bernoulli_probit"

keep=pd.read_csv('keep.csv')
coords_train=np.c_[keep['half'].astype(float),keep['x'].astype(float),keep['y'].astype(float)]

params = {'learning_rate': 0.1, 'objective': likelihood,
      'verbose': 0, 'monotone_constraints': [1, 0]}
num_boost_round = 25

params['objective'] = 'binary'

# make a small subset to check
X_train=coords_train[:10000,:] # make a small subset to check
y_train=keep['classification_for_VTK_number'].values
y_train[y_train!=3]=0
y_train[y_train==3]=1
y_train=y_train[:10000]

gp_model = gpb.GPModel(gp_coords=coords_train[:10000,:], cov_function="exponential",
                   likelihood=likelihood)
# Create dataset for gpb.train
data_train = gpb.Dataset(X_train, y_train)
bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model,
            num_boost_round=num_boost_round)
gp_model.summary() # Estimated random effects model

I get the following error

[GPBoost] [Fatal] Check failed: (static_cast(traindata->num_total_features())) == (config->monotone_constraints.size()) at C:\Users\whsigris\Dropbox\HSLU\Projects\MixedBoost\GPBoost\python-package\compile\src\LightGBM\boosting\gbdt.cpp, line 55 .

2) A bonus question: Why X_train and coords_train has to be different? I understand in your example, you wanted to remove one corner. But in other type of data, can't I use the same?

fabsig commented 2 years ago

As the error message says, your number of monotone constraints does not correspond to the number of features in X_train. Dropping monotone_constraints from params will fix the issue. I.e., replace with

params = {'learning_rate': 0.1, 'objective': likelihood, 'verbose': 0}

Note: monotone_constraints was included in the demo examples. I realize that this might create confusion and have removed it in the demo.

Note: using Gaussian processes on large data requires some approximation. Your current code will not finish in a reasonable amount of time. The go-to option for large data in GPBoost is a Vecchia approximation:

gp_model = gpb.GPModel(gp_coords=coords_train[:10000,:], cov_function="exponential",
                       likelihood=likelihood, vecchia_approx=True, num_neighbors=15)

Unfortunately, in the current implementation, the Vecchia approximation does not work nicely for non-Gaussian data including binary data. An alternative is to use a compactly supported covariance function such as tapered one or a Wendland covariance function. The latter can be done as follows:

gp_model = gpb.GPModel(gp_coords=coords_train[:10000,:], cov_function="wendland",
                       likelihood=likelihood, cov_fct_taper_range=10)

You might need to try different values for cov_fct_taper_range, or better tune it. We are currently working on a better large data solution for non-Gaussian data. But at this point, I cannot give any indication when this will be released.

Concerning your second question. X_train and coords_train do not need to be different. They can be the same. The demo contains an example of spatial data where coords_train contain spatial coordinates and X_train contains other features / covariates. This is a typical situation in spatial statistics. However, in general, X_train and coords_train do not need to be different.