Closed YoannPitarch closed 2 years ago
Hi, yes! All the estimators in shap-hypetune are sklearn estimators and can works with sklearn pipeline. Here a dummy example:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from lightgbm import *
from shaphypetune import BoostBoruta
X, y = make_classification(n_samples=6000, n_features=20, n_classes=2,
n_informative=4, n_redundant=6, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)
clf_lgbm = LGBMClassifier(n_estimators=150, random_state=0, n_jobs=-1)
boost_boruta = BoostBoruta(clf_lgbm, max_iter=200, perc=100)
model = make_pipeline(
StandardScaler(),
boost_boruta
)
model.fit(X_train, y_train,
boostboruta__eval_set=[(X_valid, y_valid)],
boostboruta__early_stopping_rounds=6,
boostboruta__verbose=0)
Here the running notebook
Setting error_score=‘raise’
in cross_validate
should help
PS: I see importance_type="shap"
while it should be importance_type="shap_importances"
Hello,
When I attempt to add a method that changes the number of columns (specifically One Hot Encoding), I get an index error:
IndexError: index 21 is out of bounds for axis 1 with size 21
In the example above, is the classifier__eval_set
's X_valid getting transformed by the Standard Scaler? Below is the test code I'm attempting to run:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from xgboost import *
from shaphypetune import BoostBoruta
import numpy as np
X, y = make_classification(n_samples=6000, n_features=20, n_classes=2,
n_informative=4, n_redundant=6, random_state=0)
X = np.hstack((X, np.random.choice(['A', 'B', 'C', 'D'], size=(X.shape[0], 1)))) #! Add a column of random A, B, and C values to X
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)
#Create a column transformer to OHE the new categorical column
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), [20]), #! Added this
('scal', StandardScaler(), [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
],
remainder='drop'
)
clf_lgbm = XGBClassifier(n_estimators=150, random_state=0, n_jobs=-1)
boost_boruta = BoostBoruta(clf_lgbm, max_iter=200, perc=100)
model = make_pipeline(
preprocessor,
boost_boruta
)
model.fit(X_train, y_train,
boostboruta__eval_set=[(X_valid, y_valid)],
boostboruta__early_stopping_rounds=6,
boostboruta__verbose=0)
And the resulting error log:
Cell In [19], line 40
33 boost_boruta = BoostBoruta(clf_lgbm, max_iter=200, perc=100)
35 model = make_pipeline(
36 preprocessor,
37 boost_boruta
38 )
---> 40 model.fit(X_train, y_train,
41 boostboruta__eval_set=[(X_valid, y_valid)],
42 boostboruta__early_stopping_rounds=6,
43 boostboruta__verbose=0)
File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\sklearn\pipeline.py:382, in Pipeline.fit(self, X, y, **fit_params)
380 if self._final_estimator != "passthrough":
381 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 382 self._final_estimator.fit(Xt, y, **fit_params_last_step)
384 return self
File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:164, in _BoostSearch.fit(self, X, y, trials, **fit_params)
161 self.boost_type_ = _check_boosting(self.estimator)
163 if self.param_grid is None:
--> 164 results = self._fit(X, y, fit_params)
166 for v in vars(results['model']):
167 if v.endswith("_") and not v.startswith("__"):
File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:67, in _BoostSearch._fit(self, X, y, fit_params, params)
65 model = self._build_model(params)
66 if isinstance(model, _BoostSelector):
---> 67 model.fit(X=X, y=y, **fit_params)
68 else:
69 with contextlib.redirect_stdout(io.StringIO()):
File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:545, in _Boruta.fit(self, X, y, **fit_params)
543 feat_id_real = np.where(self.support_)[0]
544 n_real = feat_id_real.shape[0]
--> 545 _fit_params, estimator = self._check_fit_params(fit_params, feat_id_real)
546 estimator.set_params(random_state=i + 1000)
547 _X = self._create_X(X, feat_id_real)
File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:436, in _Boruta._check_fit_params(self, fit_params, feat_id_real)
434 else:
435 if 'eval_set' in _fit_params: # iterative model fit
--> 436 _fit_params['eval_set'] = list(map(lambda x: (
437 self._create_X(x[0], feat_id_real), x[1]
...
--> 400 X_real = X[:, feat_id_real].copy()
401 X_sha = X_real.copy()
402 X_sha = np.apply_along_axis(self._random_state.permutation, 0, X_sha)
IndexError: index 21 is out of bounds for axis 1 with size 21```
Hi @ds-sebastian,
the param boostboruta__eval_set
in fit
simply pass X_valid directly to BoostBoruta
skipping the preprocessing made by ColumnTransformer
. This behavior is intended by sklearn API (see fit_params section).
In other words, BoostBoruta
fits on X_train (6000,24), which is scaled and encoded, while validating on X_valid (6000,21) not scaled and not encoded.
This happen also without shap-hypetune and simply using XGBClassifier
in the pipeline
Hi @ds-sebastian,
the param
boostboruta__eval_set
infit
simply pass X_valid directly toBoostBoruta
skipping the preprocessing made byColumnTransformer
. This behavior is intended by sklearn API (see fit_params section).In other words,
BoostBoruta
fits on X_train (6000,24), which is scaled and encoded, while validating on X_valid (6000,21) not scaled and not encoded.This happen also without shap-hypetune and simply using
XGBClassifier
in the pipeline
Ah ok, thanks, that's what I thought was happening. I think with GridSearchCV
you can pass a pipeline. For a future improvement, could it be possible to do the same with this (awesome) package? Something like:
search = BoostBoruta(pipeline)
search.fit(X,y)
When I attempt this, it seems it can't identify the boost_type
, which makes sense since it's a pipeline.
Hello, First of all, thank you for this great repo. It looks very promising. I'd like to use BoostBoruta within a scikit-pipeline. Is it possible?
For now, here is the code I've tried with no success :
No exception is thrown but no model is learned either... Any ideas why?
Thanks in advance