fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
552 stars 43 forks source link

SHAP values not matching with model output #127

Closed raheems closed 7 months ago

raheems commented 8 months ago

This is not a new issue for LightGBM model but I am wondering is there a way I could get the SHAP values in the same scale as models output?

fabsig commented 8 months ago

GPBoost just uses the same approach as LightGBM. Is it possible with LightGBM?

raheems commented 8 months ago

Looks like it has to do with how LightGBM outputs Random Forest regression models. I found two related issues #669 and #6217.

Someone suggested to multiply the contributions value with the number of trees would give the values in the scale of the output variable.

Where do we get the number of trees in gpboost?

I tried to set the n_estimators in the param list but it doesn't take n_estimators.

raheems commented 8 months ago

Here is a reproducible example:


# Simulate data
ntrain = 5000 # number of samples for training
n = 2 * ntrain # combined number of training and test data
m = 500  # number of categories / levels for grouping variable
sigma2_1 = 1  # random effect variance
sigma2 = 1 ** 2  # error variance

# Simulate non-linear mean function
np.random.seed(1)
X, F = datasets.make_friedman3(n_samples=n)
X = pd.DataFrame(X,columns=['variable_1','variable_2','variable_3','variable_4'])
F = F * 10**0.5 # with this choice, the fixed-effects regression function has the same variance as the random effects

# Simulate random effects
group_train = np.arange(ntrain)  # grouping variable
for i in range(m):
    group_train[int(i * ntrain / m):int((i + 1) * ntrain / m)] = i
group_test = np.arange(ntrain) # grouping variable for test data. Some existing and some new groups
m_test = 2 * m

for i in range(m_test):
    group_test[int(i * ntrain / m_test):int((i + 1) * ntrain / m_test)] = i
group = np.concatenate((group_train,group_test))
b = np.sqrt(sigma2_1) * np.random.normal(size=m_test)  # simulate random effects
Zb = b[group]

# Put everything together
xi = np.sqrt(sigma2) * np.random.normal(size=n)  # simulate error term
y = F + Zb + xi # observed data

# split train and test data
y_train = y[0:ntrain]
y_test = y[ntrain:n]
X_train = X.iloc[0:ntrain,]
X_test = X.iloc[ntrain:n,]

Summary of the output variable, y

pd.DataFrame(y_train).describe()

count   5000.000000
mean    4.124520
std 1.732864
min -2.448804
25% 3.063291
50% 4.251012
75% 5.259857
max 9.623841
# Define and train GPModel
gp_model = gpb.GPModel(group_data=group_train)

# create dataset for gpb.train function
data_train = gpb.Dataset(X_train, y_train)

# specify tree-boosting parameters as a dict
params = { 'objective': 'regression_l2', 'learning_rate': 0.1,
    'max_depth': 6, 'min_data_in_leaf': 5, 'verbose': 0 }

# train model
bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=31)

If I use, pred_latent=True in the bst.predict(), then we get values similar to the output

pred = bst.predict(data=X_test, group_data_pred=group_test, pred_latent=True)
pred

Out[]: {'fixed_effect': array([4.49024473, 4.61357281, 4.7404804 , ..., 4.69610829, 4.37190392,
        4.75426338]),
 'random_effect_mean': array([0.14302215, 0.14302215, 0.14302215, ..., 0.        , 0.        ,
        0.        ]),
 'random_effect_cov': None,
 'response_mean': None,
 'response_var': None}

Is it possible to get similar output when extracting the SHAP values in


import shap
import numpy as np

shap_values = shap.TreeExplainer(bst).shap_values(X_test)
raheems commented 8 months ago

I see that the num_boost_round=31 is what I need.