dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.74k forks source link

Multi-parameter optimization with custom loss function for probabilistic forecasting #5859

Open StatMixedML opened 4 years ago

StatMixedML commented 4 years ago

Dear community,

I am currently working in a probabilistic extension of XGBoost called XGBoostLSS that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

The problem is that XGBoost doesn`t permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.

Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?

trivialfis commented 4 years ago

Thanks for reaching out. Your work has already generated a lots of interest on our side. ;-) I have a proof of concept implementation for multi-target training in https://github.com/dmlc/xgboost/pull/5460 . Lastest commit on that branch broke some functionalities so it can't be used yet.

Just out of personal interest, I also looked into ngboost, in section 2.3 ii it mentioned:

Using a single tree per stage with multiple parameter outputs per leaf node would not be ideal since the splitting criterial based on the gradient of one parameter might be suboptimal with respect to the gradient of another parameter

And also from some experiments based on https://github.com/dmlc/xgboost/pull/5460, I agree with it. It might be due to the gradient, or might be due to model capacity, I'm not sure yet.

trivialfis commented 4 years ago

Just in case I misunderstood something. If you are looking for 1 parameter per tree solution, then existing code base has already supported it. See /demo/guide-python/custom_softmax.py for an example on Python.

StatMixedML commented 4 years ago

@trivialfis Thank you so much for you comments and suggestions, very much appreciated! Let me go through the material you've provided. I`ll keep you updated on the progress.

ivan-marroquin commented 3 years ago

Hi all,

I found this article that I think it may be helpful to this enhancement request for xgboost: https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b

Ivan

trivialfis commented 3 years ago

Thanks for the reference, I will look into it.

ivan-marroquin commented 3 years ago

Hi all,

I found NGBoost approach (https://github.com/stanfordmlgroup/ngboost) to conduct probability regression. It seems that their code allows to use any based tree learner to perform a regression analysis.

I have Python 3.6.5 with XGBoost 1.1.0 and NGBoost 0.3.10 and I gave try with the following code ** --> import numpy as np import xgboost as xgb import ngboost from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split import multiprocessing import matplotlib.pyplot as plt

if name == 'main': cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True) x= (x - np.mean(x, axis= 0)) / np.std(x, axis= 0) x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

learner= xgb.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 'reg:squarederror', booster= 'gbtree', tree_method= 'exact', n_jobs= cpu_count, learning_rate= 0.05, gamma= 0.15, reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner, natural_gradient= True, n_estimators= 1, learning_rate= 0.01, verbose= False, random_state= 1969)

ngb.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation) y_preds= ngb.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 1) ax.plot(range(0,len(y_validation)), y_validation, '-k') ax.plot(range(0,len(y_validation)), y_preds, '--r') ** -->

I got the following warning message: c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py:445: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption "memory consumption")

It seems that this error played a role because according the generated plot the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.

If someone could look into how to get these two packages to work together, then I believe we have the pathway to run probabilistic regression using XGBoost.

Many thanks,

Ivan

ivan-marroquin commented 3 years ago

Hi all,

Previously, I reported that the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.

I believe that I found a way to overcome such issue. I needed to set a number of estimators for xgboost as well for ngboost. The code below shows this modification:

import numpy as np import ngboost import xgboost from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split import multiprocessing import matplotlib.pyplot as plt

if name == 'main': cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True)

mean_scaler= np.mean(x, axis= 0)

std_scaler= np.std(x, axis= 0)

x= (x - mean_scaler) / std_scaler

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

# using only ngboost
ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE,
                                                    natural_gradient= True, n_estimators= 300, learning_rate= 0.01, 
                                                    verbose= False, random_state= 1969)

ngb_1.fit(x_train, y_train)

y_preds_ngboost= ngb_1.predict(x_validation)

# using xgboost with ngboost
learner= xgboost.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 
                                                    'reg:squarederror', booster= 'gbtree', tree_method= 'exact', n_jobs= 
                                                    cpu_count, learning_rate= 0.05, gamma= 0.15,
                                                    reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE, Base= 
                                                    learner, natural_gradient= True, n_estimators= 300, learning_rate= 
                                                    0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train)

y_preds_hyboost= ngb_2.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 3, figsize= (10,5))    

ax[0].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[0].plot(range(0,len(x_validation)), y_preds_ngboost, '--r', label= 'ngboost')    
ax[0].set_title("NGBOOST: validation & prediction")
ax[0].legend()

ax[1].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[1].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[1].set_title("HYBOOST: validation & prediction")
ax[1].legend()

ax[2].plot(range(0,len(x_validation)), y_preds_ngboost, '-k', label= 'ngboost')    
ax[2].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[2].set_title("NGBOOST - HYBOOST: prediction")
ax[2].legend()

plt.show()

Unfortunately, I still get the same warning message: Warning (from warnings module): File "C:\Temp\Python\Python3.6.5\lib\site-packages\xgboost\core.py", line 445 "memory consumption") UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption

Does this issue influence the quality of the learned model by xgboost?

Kind regards,

Ivan