dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.74k forks source link

(scikit-learn api) (feature request) XGBRegressor with early stopping, dynamic eval_set #7782

Open iuiu34 opened 2 years ago

iuiu34 commented 2 years ago

XGBRegressor has the parameter eval_set, where you pass an evaluation set that regressor uses to perform early stopping. Since this eval_set is fixed, when you do cross validation with n folds. In the n folds, eval_set is the same.

Would be cool to have the option that eval_set is dynamic at fit time, being a split from that particular fold. Find an example below. IF you like the idea, happy to do a proper PR

class XGBRegressorWithEarlyStop(XGBRegressor):
    """Wrapper of XGBRegressor with early stopping."""

    def __init__(self, objective="reg:squarederror", early_stopping_rounds=5,
                 test_size=0.1, eval_metric='rmse', shuffle=False, **kwargs):
        """Init as super."""
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.eval_metric = eval_metric
        self.shuffle = shuffle
        super().__init__(objective=objective, **kwargs)

    def fit(self, x, y, verbose=False, sample_weight=None):
        """Fit classifier."""
        if sample_weight is not None:
            x_train, x_val, y_train, y_val, w_train, w_val = train_test_split(
                x, y, sample_weight,
                test_size=self.test_size, shuffle=self.shuffle)
        else:
            x_train, x_val, y_train, y_val = train_test_split(
                x, y,
                test_size=self.test_size, shuffle=self.shuffle)
            w_train, w_val = None, None
        super().fit(x_train, y_train,
                    early_stopping_rounds=self.early_stopping_rounds,
                    eval_metric=self.eval_metric,
                    eval_set=[(x_val, y_val)],
                    verbose=verbose,
                    sample_weight=w_train,
                    sample_weight_eval_set=[w_val])
        return self
trivialfis commented 2 years ago

Thank you for the offer, I understand that sklearn is doing es this way. But we have to consider other optional information than sample_weight, also GPU data structure and distributed training. It might not be a flexible design that we embed data partition into our code.

trivialfis commented 2 years ago

Having said that, I would welcome any discussion around the feature and design.

iuiu34 commented 2 years ago

assuming we're talking about XGBRegressor (XGBClassifer is equivalent)

I see 3 options

Option 1 - new class Define new class XGBRegressorWithEarlyStop as above. This will add the feature, without altering native class. But then you have 2 very similar classes, more documentation to maintain, etc.

Option 2 - implicit params We don't add any new param to XGBRegressor. But now if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None) instead of raising the error eval_set is not defined, the class creates the eval_set dynamically at fit time.

Option 2 - explicit params We add new param to XGBRegressor

eval_set_dynamic: bool = True - eval_set is created dynamically.

if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None, eval_set_dynamic=False), then still raises the error eval_set is not defined if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=[(x_val,y_val)], eval_set_dynamic=True), then raises the error you can not pass static eval_set when eval_set_dynamic is True.

trivialfis commented 2 years ago

Option 2 seems to be reasonable:

def fit(...):
    if self.early_stopping_rounds is not None and eval_set is None:
        train_X, valid_X, train_y, valid_y = train_test_split(...)
    else:
       ...

I think we would like to keep the eval_set in fit, as it's a data-dependent parameter and should be specified under the fit method according to the sklearn estimator guideline.

The next issue is parameters other than sample_weight, we also have base_margin for all estimators. Also, we have learning-to-rank and survival training is coming to the sklearn interface, which has its own way of specifying the data. Specializing over each of them will complicate the code significantly.

Lastly, as mentioned in the previous comment, distributed training and GPU input also needs to be considered.

Zelpuz commented 1 week ago

I realize this is an old issue and probably low-priority, but I wanted to add another workaround for people who are encountering this issue with early_stopping_rounds in XGBoost with the sklearn interface, since searching for this issue might lead them here.

By wrapping the XGBRegressor in a standard sklearn model class, you can dynamically set the eval_set within the provided training data, which makes it compatible with sklearn's cross-validation methods.

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.model_selection import train_test_split

class wrapper(BaseEstimator, RegressorMixin):
    def __init__(self, model, test_size=0.2, shuffle=True):
        self.model = model
        self.test_size = test_size
        self.shuffle = shuffle

    def fit(self, X, y):
        if self.model.get_params()['early_stopping_rounds'] > 0:
            X_train, X_test, y_train, y_test = train_test_split(
                X, 
                y, 
                test_size=self.test_size, 
                shuffle=self.shuffle, 
            )
            self.model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=0)
        else:
            self.model.fit(X, y, verbose=0)
        self.is_fitted_ = True
        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)
        return self.model.predict(X)

    def set_params(self, **params):
        self.model.set_params(**params)

    def get_booster(self):
        return self.model.get_booster()

    def get_dump(self, *args, **kwargs):
        return self.model.get_booster().get_dump(*args, **kwargs)

    def save_config(self):
        self.model.save_config()

xgb = wrapper(XGBRegressor())

This approach from the sklearn side of things might help avoid some of the messiness with other hyperparams, and I don't think it would be too hard to adapt it for classifiers, provided you know how you want to separate out the eval_set.