dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

Bootstrap Confidence Intervals for XGBoost regression (Python) #5475

Open Shafi2016 opened 4 years ago

Shafi2016 commented 4 years ago

I want to construct Bootstrap Confidence Intervals for XGBoost regression using python. I developed my case based on codes (https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/#comment-528118). Question: I am getting a one bin histogram. I get the single value for the score when we do n_iterations for the bootstrap. This is the problem and it is related to the way I am getting RMSE. Though I tried to find RMSE in different ways. yet, I could not solve the problem How can we solve it?

image

import numpy from pandas import read_csv from sklearn.datasets import load_boston from sklearn.utils import resample from matplotlib import pyplot from xgboost import XGBRegressor import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error

load dataset

boston_dataset = load_boston()

df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)

df['MEDV'] = boston_dataset.target values1 = df.values

configure bootstrap

n_iterations = 1000 n_size = int(len(df) * 0.50)

run bootstrap

stats = list()

# prepare train and test sets

for i in range(n_iterations):

prepare train and test sets

train = resample(values1, n_samples=n_size)
test = numpy.array([x for x in values1 if x.tolist() not in train.tolist()])

model = XGBRegressor() ## Final for the papers

X_train = train[:,:-1] y_train = train[:,-1] X_test = test[:,:-1] y_test = test[:,-1]

model.fit(X_train,y_train) predictions = model.predict(X_test)

make predictions

def rmse_calculator(predicted, actual):

assert len(predicted) == len(actual)
return np.sqrt(
            np.mean(
                np.power(predicted- actual, 2)))
score = rmse_calculator(y_test , predictions)

score = mean_squared_error(y_test, predictions) ** 0.5

yt = np.asarray(y_test) y_pred = np.asarray(predictions) score = np.sqrt(mean_squared_error(yt,y_pred)) print(score) stats.append(score)

plot scores

pyplot.hist(stats) pyplot.show()

confidence intervals

alpha = 0.95 p = ((1.0-alpha)/2.0) 100 lower = max(0.0, numpy.percentile(stats, p)) p = (alpha+((1.0-alpha)/2.0)) 100 upper = min(1.0, numpy.percentile(stats, p)) print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha100, lower100, upper*100))

jbrownlee commented 4 years ago

Try plotting the data to confirm there is a distribution. Perhaps there is not.

If there is, try changing the number of bins in the histogram plot.

Shafi2016 commented 4 years ago

Thanks a lot: Yes tried to change the number of bins but it did not work as:

sns.distplot(stats, hist=True, kde=False, bins=int(30/2), color = 'blue', hist_kws={'edgecolor':'black'})

I checked with XGBoost Classifier with the data (https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv). It works fine.

image

import numpy from pandas import read_csv from sklearn.utils import resample from xgboost import XGBClassifier from sklearn.metrics import accuracy_score from matplotlib import pyplot

load dataset

data = read_csv('pima-indians-diabetes.data.csv', header=None) values = data.values

configure bootstrap

n_iterations = 100 n_size = int(len(data) * 0.50)

run bootstrap

stats = list() for i in range(n_iterations):

prepare train and test sets

train = resample(values, n_samples=n_size)
test = numpy.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
model =XGBClassifier()
model.fit(train[:,:-1], train[:,-1])
# evaluate model
predictions = model.predict(test[:,:-1])
score = accuracy_score(test[:,-1], predictions)
print(score)
stats.append(score)

plot scores

pyplot.hist(stats) pyplot.show()

confidence intervals

alpha = 0.95 p = ((1.0-alpha)/2.0) 100 lower = max(0.0, numpy.percentile(stats, p)) p = (alpha+((1.0-alpha)/2.0)) 100 upper = min(1.0, numpy.percentile(stats, p)) print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha100, lower100, upper*100))

Shafi2016 commented 4 years ago

I also ploted the histogram of Prediction (XGBoost regression) It seems fine:

image

yahmadyar95 commented 2 years ago

Hi, I have this error for classifier 'continuous is not supported' How can I solve it ?

jbrownlee commented 2 years ago

Hi Dmlc/Xgboost,

Thanks for asking.

I’m eager to help, but I just don’t have the capacity to debug code for you.

I am happy to make some suggestions:

Regards,

Jason Brownlee, Ph.D. Making Developers Awesome at Machine Learning

Do you need help with machine learning? Visit: MachineLearningMastery.com http://machinelearningmastery.com/

On Mon, May 16, 2022 at 5:41 AM yahmadyar95 @.***> wrote:

Hi, I have this error for classifier 'continuous is not supported' How can I solve it ?

— Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/5475#issuecomment-1127452785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADEWZDJTLGX3WOFWRDHW3VKIJ3FANCNFSM4L2P7RMQ . You are receiving this because you commented.Message ID: @.***>