dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.04k stars 8.69k forks source link

Improve XGBoost quantile predictions #10136

Closed Manjubn777 closed 5 months ago

Manjubn777 commented 5 months ago

Hi team,

I’m currently employing the XGBoost model to predict the 0.1, 0.5, and 0.9 quantiles. I am using reg: quantileerror as objective and passing respective quantile_alpha values, I have a model for each quantile. I’ve observed instances where the 0.9 quantile predictions are lower than the actual values, leading to overlaps and in some cases, 0.1 forecasts higher than actuals and 0.9 forecasts are lower than actuals. Could you please provide some prompt guidance on enhancing the accuracy and generate narrow forecasts?

XGBoost version is : 2.0.3

trivialfis commented 5 months ago

it's a limitation of the current implementation of quantile regression that we might have quantile crossing. So far, we haven't got any solution yet. If anyone has suggestions for algorithms, please share.

Manjubn777 commented 5 months ago

Okay, are you planning to add additional loss functions for quantile forecasting near future ?

trivialfis commented 5 months ago

I can't make any promise due to other work at hand, but at the same time, please look into XGBoostLSS, which embeds distributional assumptions into the model.

RektPunk commented 1 month ago

It seems to be a similar issue to https://github.com/dmlc/xgboost/issues/9848. Here is my solution. To address quantile crossing problem, I have implemented a custom loss and monotonic constraints to ensure that the quantiles satisfy a non-crossing condition inspired by references below.

Regarding the concept, roughly, an input data should be stacked as much as the number of input alphas (which are quantiles), and each alpha is put into a column as Cannon's research. Next, calculate the gradient of the composite quantile loss for the new data, set hess = 1. Finally, if we put increasing monotone constraints as in the alpha column, the monotone constraints for alpha with any new data is preserved even though alphas are differ from train alphas.

The code looks like below:

from typing import List, Union, Dict, Any, Tuple
from functools import partial
from itertools import repeat, chain

import numpy as np
import xgboost as xgb
import pandas as pd

def _grad_rho(u, alpha) -> np.ndarray:
    return -(alpha - (u < 0).astype(float))

def check_loss_grad_hess(
    y_pred: np.ndarray, dtrain: xgb.DMatrix, alphas: List[float]
) -> Tuple[np.ndarray, np.ndarray]:
    _len_alpha = len(alphas)
    _y_train = dtrain.get_label()
    _y_pred_reshaped = y_pred.reshape(_len_alpha, -1)
    _y_train_reshaped = _y_train.reshape(_len_alpha, -1)

    grads = []
    for alpha_inx in range(_len_alpha):
        _err_for_alpha = _y_train_reshaped[alpha_inx] - _y_pred_reshaped[alpha_inx]
        grad = _grad_rho(_err_for_alpha, alphas[alpha_inx])
        grads.append(grad)

    grad = np.concatenate(grads)
    hess = np.ones(_y_train.shape)

    return grad, hess

def _alpha_validate(
    alphas: Union[List[float], float],
) -> List[float]:
    if isinstance(alphas, float):
        alphas = [alphas]
    return alphas

def _prepare_x(
    x: Union[pd.DataFrame, pd.Series, np.ndarray],
    alphas: List[float],
) -> pd.DataFrame:
    if isinstance(x, np.ndarray) or isinstance(x, pd.Series):
        x = pd.DataFrame(x)
    assert "_tau" not in x.columns, "Column name '_tau' is not allowed."
    _alpha_repeat_count_list = [list(repeat(alpha, len(x))) for alpha in alphas]
    _alpha_repeat_list = list(chain.from_iterable(_alpha_repeat_count_list))
    _repeated_x = pd.concat([x] * len(alphas), axis=0)

    _repeated_x = _repeated_x.assign(
        _tau=_alpha_repeat_list,
    )
    return _repeated_x

def _prepare_train(
    x: Union[pd.DataFrame, pd.Series, np.ndarray],
    y: Union[pd.Series, np.ndarray],
    alphas: List[float],
) -> Dict[str, Union[pd.DataFrame, np.ndarray]]:
    _train_df = _prepare_x(x, alphas)
    _repeated_y = np.concatenate(list(repeat(y, len(alphas))))
    return (_train_df, _repeated_y)

class MonotonicQuantileRegressor:
    def __init__(
        self,
        x: Union[pd.DataFrame, pd.Series, np.ndarray],
        y: Union[pd.Series, np.ndarray],
        alphas: Union[List[float], float],
    ):
        alphas = _alpha_validate(alphas)
        self.x_train, self.y_train = _prepare_train(x, y, alphas)
        self.dataset = xgb.DMatrix(data=self.x_train, label=self.y_train)
        self.obj = partial(check_loss_grad_hess, alphas=alphas)

    def train(self, params: Dict[str, Any]) -> xgb.Booster:
        self._params = params.copy()
        if "monotone_constraints" in self._params:
            _monotone_constraints = self._params["monotone_constraints"]
            _monotone_constraints.append(1)
            self._params["monotone_constraints"] = tuple(_monotone_constraints)
        else:
            self._params.update(
                {
                    "monotone_constraints": tuple([
                        1 if "_tau" == col else 0 for col in self.x_train.columns
                    ])
                }
            )
        self.model = xgb.train(
            dtrain=self.dataset,
            verbose_eval=False,
            params=self._params,
            obj=self.obj,
        )
        return self.model

    def predict(
        self,
        x: Union[pd.DataFrame, pd.Series, np.ndarray],
        alphas: Union[List[float], float],
    ) -> np.ndarray:
        alphas = _alpha_validate(alphas)
        _x = _prepare_x(x, alphas)
        _x = xgb.DMatrix(_x)
        _pred = self.model.predict(_x)
        _pred = _pred.reshape(len(alphas), len(x))
        return _pred

For test, I created a simple example.

sample_size = 500
params = {
    "learning_rate": 0.65,
    "max_depth": 10,
}

alphas = [0.3, 0.4, 0.5, 0.6, 0.7]
x = np.linspace(-10, 10, sample_size)
x_test = np.linspace(-10, 10, sample_size)
y_noise = np.sin(x) + np.random.uniform(-0.4, 0.4, sample_size)
y_test = np.sin(x_test) + np.random.uniform(-0.4, 0.4, sample_size)

monotonic_quantile_regressor = MonotonicQuantileRegressor(x=x, y=y_noise, alphas=alphas)
model = monotonic_quantile_regressor.train(params=params)
preds = monotonic_quantile_regressor.predict(x=x_test, alphas=alphas)
preds_df = pd.DataFrame(preds).T
(preds_df.diff(axis = 1) < 0).sum(axis = 1).sum(axis = 0) # 0 means non-crossing holds

When visualized, the result was as shown in the figure below.

스크린샷 2024-07-06 오후 10 01 32

The latest code can be found here. The quantile estimator with LightGBM is also included and one can simply use it via pip install quantile-tree. I am very curious about how other guys have approached this problem and would love to hear any new ideas, insights, or feedbacks.

References

Non-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes: https://link.springer.com/article/10.1007/s00477-018-1573-6 Learning Multiple Quantiles With Neural Networks: https://www.tandfonline.com/doi/full/10.1080/10618600.2021.1909601