Closed raheems closed 9 months ago
GPBoost just uses the same approach as LightGBM. Is it possible with LightGBM?
Looks like it has to do with how LightGBM outputs Random Forest regression models. I found two related issues #669 and #6217.
Someone suggested to multiply the contributions value with the number of trees would give the values in the scale of the output variable.
Where do we get the number of trees in gpboost?
I tried to set the n_estimators in the param list but it doesn't take n_estimators.
Here is a reproducible example:
# Simulate data
ntrain = 5000 # number of samples for training
n = 2 * ntrain # combined number of training and test data
m = 500 # number of categories / levels for grouping variable
sigma2_1 = 1 # random effect variance
sigma2 = 1 ** 2 # error variance
# Simulate non-linear mean function
np.random.seed(1)
X, F = datasets.make_friedman3(n_samples=n)
X = pd.DataFrame(X,columns=['variable_1','variable_2','variable_3','variable_4'])
F = F * 10**0.5 # with this choice, the fixed-effects regression function has the same variance as the random effects
# Simulate random effects
group_train = np.arange(ntrain) # grouping variable
for i in range(m):
group_train[int(i * ntrain / m):int((i + 1) * ntrain / m)] = i
group_test = np.arange(ntrain) # grouping variable for test data. Some existing and some new groups
m_test = 2 * m
for i in range(m_test):
group_test[int(i * ntrain / m_test):int((i + 1) * ntrain / m_test)] = i
group = np.concatenate((group_train,group_test))
b = np.sqrt(sigma2_1) * np.random.normal(size=m_test) # simulate random effects
Zb = b[group]
# Put everything together
xi = np.sqrt(sigma2) * np.random.normal(size=n) # simulate error term
y = F + Zb + xi # observed data
# split train and test data
y_train = y[0:ntrain]
y_test = y[ntrain:n]
X_train = X.iloc[0:ntrain,]
X_test = X.iloc[ntrain:n,]
Summary of the output variable, y
pd.DataFrame(y_train).describe()
count 5000.000000
mean 4.124520
std 1.732864
min -2.448804
25% 3.063291
50% 4.251012
75% 5.259857
max 9.623841
# Define and train GPModel
gp_model = gpb.GPModel(group_data=group_train)
# create dataset for gpb.train function
data_train = gpb.Dataset(X_train, y_train)
# specify tree-boosting parameters as a dict
params = { 'objective': 'regression_l2', 'learning_rate': 0.1,
'max_depth': 6, 'min_data_in_leaf': 5, 'verbose': 0 }
# train model
bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=31)
If I use, pred_latent=True
in the bst.predict()
, then we get values similar to the output
pred = bst.predict(data=X_test, group_data_pred=group_test, pred_latent=True)
pred
Out[]: {'fixed_effect': array([4.49024473, 4.61357281, 4.7404804 , ..., 4.69610829, 4.37190392,
4.75426338]),
'random_effect_mean': array([0.14302215, 0.14302215, 0.14302215, ..., 0. , 0. ,
0. ]),
'random_effect_cov': None,
'response_mean': None,
'response_var': None}
Is it possible to get similar output when extracting the SHAP values in
import shap
import numpy as np
shap_values = shap.TreeExplainer(bst).shap_values(X_test)
I see that the num_boost_round=31
is what I need.
This is not a new issue for LightGBM model but I am wondering is there a way I could get the SHAP values in the same scale as models output?