Shafi2016 commented 4 years ago

I want to construct Bootstrap Confidence Intervals for XGBoost regression using python. I developed my case based on codes (https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/#comment-528118). Question: I am getting a one bin histogram. I get the single value for the score when we do n_iterations for the bootstrap. This is the problem and it is related to the way I am getting RMSE. Though I tried to find RMSE in different ways. yet, I could not solve the problem How can we solve it?

import numpy from pandas import read_csv from sklearn.datasets import load_boston from sklearn.utils import resample from matplotlib import pyplot from xgboost import XGBRegressor import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error

load dataset

boston_dataset = load_boston()

df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)

df['MEDV'] = boston_dataset.target values1 = df.values

configure bootstrap

n_iterations = 1000 n_size = int(len(df) * 0.50)

run bootstrap

stats = list()

# prepare train and test sets

for i in range(n_iterations):

prepare train and test sets

train = resample(values1, n_samples=n_size)
test = numpy.array([x for x in values1 if x.tolist() not in train.tolist()])

model = XGBRegressor() ## Final for the papers

X_train = train[:,:-1] y_train = train[:,-1] X_test = test[:,:-1] y_test = test[:,-1]

model.fit(X_train,y_train) predictions = model.predict(X_test)

make predictions

def rmse_calculator(predicted, actual):

assert len(predicted) == len(actual)
return np.sqrt(
            np.mean(
                np.power(predicted- actual, 2)))
score = rmse_calculator(y_test , predictions)

score = mean_squared_error(y_test, predictions) ** 0.5

yt = np.asarray(y_test) y_pred = np.asarray(predictions) score = np.sqrt(mean_squared_error(yt,y_pred)) print(score) stats.append(score)

plot scores

pyplot.hist(stats) pyplot.show()

confidence intervals

alpha = 0.95 p = ((1.0-alpha)/2.0) 100 lower = max(0.0, numpy.percentile(stats, p)) p = (alpha+((1.0-alpha)/2.0)) 100 upper = min(1.0, numpy.percentile(stats, p)) print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha100, lower100, upper*100))

jbrownlee commented 4 years ago

Try plotting the data to confirm there is a distribution. Perhaps there is not.

If there is, try changing the number of bins in the histogram plot.

Shafi2016 commented 4 years ago

Thanks a lot: Yes tried to change the number of bins but it did not work as:

sns.distplot(stats, hist=True, kde=False, bins=int(30/2), color = 'blue', hist_kws={'edgecolor':'black'})

I checked with XGBoost Classifier with the data (https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv). It works fine.

import numpy from pandas import read_csv from sklearn.utils import resample from xgboost import XGBClassifier from sklearn.metrics import accuracy_score from matplotlib import pyplot

load dataset

data = read_csv('pima-indians-diabetes.data.csv', header=None) values = data.values

configure bootstrap

n_iterations = 100 n_size = int(len(data) * 0.50)

run bootstrap

stats = list() for i in range(n_iterations):

prepare train and test sets

train = resample(values, n_samples=n_size)
test = numpy.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
model =XGBClassifier()
model.fit(train[:,:-1], train[:,-1])
# evaluate model
predictions = model.predict(test[:,:-1])
score = accuracy_score(test[:,-1], predictions)
print(score)
stats.append(score)

plot scores

pyplot.hist(stats) pyplot.show()

confidence intervals

alpha = 0.95 p = ((1.0-alpha)/2.0) 100 lower = max(0.0, numpy.percentile(stats, p)) p = (alpha+((1.0-alpha)/2.0)) 100 upper = min(1.0, numpy.percentile(stats, p)) print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha100, lower100, upper*100))

Shafi2016 commented 4 years ago

I also ploted the histogram of Prediction (XGBoost regression) It seems fine:

yahmadyar95 commented 2 years ago

Hi, I have this error for classifier 'continuous is not supported' How can I solve it ?

jbrownlee commented 2 years ago

Hi Dmlc/Xgboost,

Thanks for asking.

I’m eager to help, but I just don’t have the capacity to debug code for you.

I am happy to make some suggestions:

Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
Consider cutting the problem back to just one or a few simple examples.
Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
Consider posting your question and code to StackOverflow.

Regards,

Jason Brownlee, Ph.D. Making Developers Awesome at Machine Learning

Do you need help with machine learning? Visit: MachineLearningMastery.com http://machinelearningmastery.com/

On Mon, May 16, 2022 at 5:41 AM yahmadyar95 @.***> wrote:

Hi, I have this error for classifier 'continuous is not supported' How can I solve it ?

— Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/5475#issuecomment-1127452785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADEWZDJTLGX3WOFWRDHW3VKIJ3FANCNFSM4L2P7RMQ . You are receiving this because you commented.Message ID: @.***>

dmlc / xgboost

Bootstrap Confidence Intervals for XGBoost regression (Python) #5475

load dataset

configure bootstrap

run bootstrap

prepare train and test sets

make predictions

score = mean_squared_error(y_test, predictions) ** 0.5

plot scores

confidence intervals

load dataset

configure bootstrap

run bootstrap

prepare train and test sets

plot scores

confidence intervals