Open StatMixedML opened 4 years ago
Thanks for reaching out. Your work has already generated a lots of interest on our side. ;-) I have a proof of concept implementation for multi-target training in https://github.com/dmlc/xgboost/pull/5460 . Lastest commit on that branch broke some functionalities so it can't be used yet.
Just out of personal interest, I also looked into ngboost, in section 2.3 ii
it mentioned:
Using a single tree per stage with multiple parameter outputs per leaf node would not be ideal since the splitting criterial based on the gradient of one parameter might be suboptimal with respect to the gradient of another parameter
And also from some experiments based on https://github.com/dmlc/xgboost/pull/5460, I agree with it. It might be due to the gradient, or might be due to model capacity, I'm not sure yet.
Just in case I misunderstood something. If you are looking for 1 parameter per tree solution, then existing code base has already supported it. See /demo/guide-python/custom_softmax.py for an example on Python.
@trivialfis Thank you so much for you comments and suggestions, very much appreciated! Let me go through the material you've provided. I`ll keep you updated on the progress.
Hi all,
I found this article that I think it may be helpful to this enhancement request for xgboost: https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b
Ivan
Thanks for the reference, I will look into it.
Hi all,
I found NGBoost approach (https://github.com/stanfordmlgroup/ngboost) to conduct probability regression. It seems that their code allows to use any based tree learner to perform a regression analysis.
I have Python 3.6.5 with XGBoost 1.1.0 and NGBoost 0.3.10 and I gave try with the following code ** --> import numpy as np import xgboost as xgb import ngboost from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split import multiprocessing import matplotlib.pyplot as plt
if name == 'main': cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)
x, y= load_boston(return_X_y= True) x= (x - np.mean(x, axis= 0)) / np.std(x, axis= 0) x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)
learner= xgb.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 'reg:squarederror', booster= 'gbtree', tree_method= 'exact', n_jobs= cpu_count, learning_rate= 0.05, gamma= 0.15, reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)
ngb= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner, natural_gradient= True, n_estimators= 1, learning_rate= 0.01, verbose= False, random_state= 1969)
ngb.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation) y_preds= ngb.predict(x_validation)
fig, ax= plt.subplots(nrows= 1, ncols= 1) ax.plot(range(0,len(y_validation)), y_validation, '-k') ax.plot(range(0,len(y_validation)), y_preds, '--r') ** -->
I got the following warning message: c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py:445: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption "memory consumption")
It seems that this error played a role because according the generated plot the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.
If someone could look into how to get these two packages to work together, then I believe we have the pathway to run probabilistic regression using XGBoost.
Many thanks,
Ivan
Hi all,
Previously, I reported that the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.
I believe that I found a way to overcome such issue. I needed to set a number of estimators for xgboost as well for ngboost. The code below shows this modification:
import numpy as np import ngboost import xgboost from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split import multiprocessing import matplotlib.pyplot as plt
if name == 'main': cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)
x, y= load_boston(return_X_y= True)
mean_scaler= np.mean(x, axis= 0)
std_scaler= np.std(x, axis= 0)
x= (x - mean_scaler) / std_scaler
x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)
# using only ngboost
ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE,
natural_gradient= True, n_estimators= 300, learning_rate= 0.01,
verbose= False, random_state= 1969)
ngb_1.fit(x_train, y_train)
y_preds_ngboost= ngb_1.predict(x_validation)
# using xgboost with ngboost
learner= xgboost.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective=
'reg:squarederror', booster= 'gbtree', tree_method= 'exact', n_jobs=
cpu_count, learning_rate= 0.05, gamma= 0.15,
reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)
ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE, Base=
learner, natural_gradient= True, n_estimators= 300, learning_rate=
0.01, verbose= False, random_state= 1969)
ngb_2.fit(x_train, y_train)
y_preds_hyboost= ngb_2.predict(x_validation)
fig, ax= plt.subplots(nrows= 1, ncols= 3, figsize= (10,5))
ax[0].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')
ax[0].plot(range(0,len(x_validation)), y_preds_ngboost, '--r', label= 'ngboost')
ax[0].set_title("NGBOOST: validation & prediction")
ax[0].legend()
ax[1].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')
ax[1].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')
ax[1].set_title("HYBOOST: validation & prediction")
ax[1].legend()
ax[2].plot(range(0,len(x_validation)), y_preds_ngboost, '-k', label= 'ngboost')
ax[2].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')
ax[2].set_title("NGBOOST - HYBOOST: prediction")
ax[2].legend()
plt.show()
Unfortunately, I still get the same warning message: Warning (from warnings module): File "C:\Temp\Python\Python3.6.5\lib\site-packages\xgboost\core.py", line 445 "memory consumption") UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
Does this issue influence the quality of the learned model by xgboost?
Kind regards,
Ivan
Dear community,
I am currently working in a probabilistic extension of XGBoost called XGBoostLSS that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.
The problem is that XGBoost doesn`t permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.
Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?