Open iuiu34 opened 2 years ago
Thank you for the offer, I understand that sklearn is doing es this way. But we have to consider other optional information than sample_weight, also GPU data structure and distributed training. It might not be a flexible design that we embed data partition into our code.
Having said that, I would welcome any discussion around the feature and design.
assuming we're talking about XGBRegressor
(XGBClassifer
is equivalent)
I see 3 options
Option 1 - new class
Define new class XGBRegressorWithEarlyStop
as above.
This will add the feature, without altering native class.
But then you have 2 very similar classes, more documentation to maintain, etc.
Option 2 - implicit params
We don't add any new param to XGBRegressor
.
But now if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None)
instead of raising the error eval_set is not defined
, the class creates the eval_set
dynamically at fit
time.
Option 2 - explicit params
We add new param to XGBRegressor
eval_set_dynamic: bool = True - eval_set is created dynamically.
if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None, eval_set_dynamic=False)
, then still raises the error eval_set is not defined
if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=[(x_val,y_val)], eval_set_dynamic=True)
, then raises the error you can not pass static eval_set when eval_set_dynamic is True.
Option 2 seems to be reasonable:
def fit(...):
if self.early_stopping_rounds is not None and eval_set is None:
train_X, valid_X, train_y, valid_y = train_test_split(...)
else:
...
I think we would like to keep the eval_set
in fit
, as it's a data-dependent parameter and should be specified under the fit
method according to the sklearn estimator guideline.
The next issue is parameters other than sample_weight
, we also have base_margin
for all estimators. Also, we have learning-to-rank and survival training is coming to the sklearn interface, which has its own way of specifying the data. Specializing over each of them will complicate the code significantly.
Lastly, as mentioned in the previous comment, distributed training and GPU input also needs to be considered.
I realize this is an old issue and probably low-priority, but I wanted to add another workaround for people who are encountering this issue with early_stopping_rounds
in XGBoost with the sklearn interface, since searching for this issue might lead them here.
By wrapping the XGBRegressor
in a standard sklearn model class, you can dynamically set the eval_set
within the provided training data, which makes it compatible with sklearn's cross-validation methods.
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.model_selection import train_test_split
class wrapper(BaseEstimator, RegressorMixin):
def __init__(self, model, test_size=0.2, shuffle=True):
self.model = model
self.test_size = test_size
self.shuffle = shuffle
def fit(self, X, y):
if self.model.get_params()['early_stopping_rounds'] > 0:
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=self.test_size,
shuffle=self.shuffle,
)
self.model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=0)
else:
self.model.fit(X, y, verbose=0)
self.is_fitted_ = True
return self
def predict(self, X):
check_is_fitted(self)
X = check_array(X)
return self.model.predict(X)
def set_params(self, **params):
self.model.set_params(**params)
def get_booster(self):
return self.model.get_booster()
def get_dump(self, *args, **kwargs):
return self.model.get_booster().get_dump(*args, **kwargs)
def save_config(self):
self.model.save_config()
xgb = wrapper(XGBRegressor())
This approach from the sklearn side of things might help avoid some of the messiness with other hyperparams, and I don't think it would be too hard to adapt it for classifiers, provided you know how you want to separate out the eval_set
.
XGBRegressor has the parameter eval_set, where you pass an evaluation set that regressor uses to perform early stopping. Since this eval_set is fixed, when you do cross validation with n folds. In the n folds, eval_set is the same.
Would be cool to have the option that eval_set is dynamic at
fit
time, being a split from that particular fold. Find an example below. IF you like the idea, happy to do a proper PR