cerlymarco / shap-hypetune

A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.
MIT License
567 stars 71 forks source link

Can BoostBoruta be used in a scikit-pipeline? #17

Closed YoannPitarch closed 2 years ago

YoannPitarch commented 2 years ago

Hello, First of all, thank you for this great repo. It looks very promising. I'd like to use BoostBoruta within a scikit-pipeline. Is it possible?

For now, here is the code I've tried with no success :

# get the categorical and numeric column names
num_cols = X_train.select_dtypes(exclude=['object']).columns.tolist()
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()

# pipeline for numerical columns
num_pipe = make_pipeline(
    StandardScaler()
)
# pipeline for categorical columns
cat_pipe = make_pipeline(
    OneHotEncoder(handle_unknown='ignore', sparse=False)
)

# combine both the pipelines
full_pipe = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

model = BoostBoruta(
    clf_lgbm, param_grid=param_dist_hyperopt, n_iter=8, sampling_seed=0, importance_type="shap", train_importance=True,n_jobs=-1, verbose=2
)

pipeline_hypetune = make_pipeline(full_pipe, model)
model_selection = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=2022)

results = cross_validate(pipeline_hypetune, X_train, y, scoring='accuracy', cv=model_selection, return_estimator=True)

No exception is thrown but no model is learned either... Any ideas why?

Thanks in advance

cerlymarco commented 2 years ago

Hi, yes! All the estimators in shap-hypetune are sklearn estimators and can works with sklearn pipeline. Here a dummy example:

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from lightgbm import *
from shaphypetune import BoostBoruta

X, y = make_classification(n_samples=6000, n_features=20, n_classes=2, 
                                   n_informative=4, n_redundant=6, random_state=0)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

clf_lgbm = LGBMClassifier(n_estimators=150, random_state=0, n_jobs=-1)

boost_boruta = BoostBoruta(clf_lgbm, max_iter=200, perc=100)

model = make_pipeline(
    StandardScaler(),
    boost_boruta
)

model.fit(X_train, y_train, 
          boostboruta__eval_set=[(X_valid, y_valid)], 
          boostboruta__early_stopping_rounds=6, 
          boostboruta__verbose=0)

Here the running notebook

Setting error_score=‘raise’ in cross_validate should help

PS: I see importance_type="shap" while it should be importance_type="shap_importances"

ds-sebastian commented 2 years ago

Hello,

When I attempt to add a method that changes the number of columns (specifically One Hot Encoding), I get an index error: IndexError: index 21 is out of bounds for axis 1 with size 21

In the example above, is the classifier__eval_set's X_valid getting transformed by the Standard Scaler? Below is the test code I'm attempting to run:

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from xgboost import *
from shaphypetune import BoostBoruta
import numpy as np

X, y = make_classification(n_samples=6000, n_features=20, n_classes=2, 
                                   n_informative=4, n_redundant=6, random_state=0)

X = np.hstack((X, np.random.choice(['A', 'B', 'C', 'D'], size=(X.shape[0], 1)))) #! Add a column of random A, B, and C values to X

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

#Create a column transformer to OHE the new categorical column
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), [20]), #! Added this
        ('scal', StandardScaler(), [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
    ],
    remainder='drop'
)

clf_lgbm = XGBClassifier(n_estimators=150, random_state=0, n_jobs=-1)

boost_boruta = BoostBoruta(clf_lgbm, max_iter=200, perc=100)

model = make_pipeline(
    preprocessor,
    boost_boruta
)

model.fit(X_train, y_train, 
          boostboruta__eval_set=[(X_valid, y_valid)], 
          boostboruta__early_stopping_rounds=6, 
          boostboruta__verbose=0)

And the resulting error log:


Cell In [19], line 40
     33 boost_boruta = BoostBoruta(clf_lgbm, max_iter=200, perc=100)
     35 model = make_pipeline(
     36     preprocessor,
     37     boost_boruta
     38 )
---> 40 model.fit(X_train, y_train, 
     41           boostboruta__eval_set=[(X_valid, y_valid)], 
     42           boostboruta__early_stopping_rounds=6, 
     43           boostboruta__verbose=0)

File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\sklearn\pipeline.py:382, in Pipeline.fit(self, X, y, **fit_params)
    380     if self._final_estimator != "passthrough":
    381         fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 382         self._final_estimator.fit(Xt, y, **fit_params_last_step)
    384 return self

File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:164, in _BoostSearch.fit(self, X, y, trials, **fit_params)
    161 self.boost_type_ = _check_boosting(self.estimator)
    163 if self.param_grid is None:
--> 164     results = self._fit(X, y, fit_params)
    166     for v in vars(results['model']):
    167         if v.endswith("_") and not v.startswith("__"):

File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:67, in _BoostSearch._fit(self, X, y, fit_params, params)
     65 model = self._build_model(params)
     66 if isinstance(model, _BoostSelector):
---> 67     model.fit(X=X, y=y, **fit_params)
     68 else:
     69     with contextlib.redirect_stdout(io.StringIO()):

File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:545, in _Boruta.fit(self, X, y, **fit_params)
    543 feat_id_real = np.where(self.support_)[0]
    544 n_real = feat_id_real.shape[0]
--> 545 _fit_params, estimator = self._check_fit_params(fit_params, feat_id_real)
    546 estimator.set_params(random_state=i + 1000)
    547 _X = self._create_X(X, feat_id_real)

File c:\Users\sgobat\Miniconda3\envs\auth_env\lib\site-packages\shaphypetune\_classes.py:436, in _Boruta._check_fit_params(self, fit_params, feat_id_real)
    434 else:
    435     if 'eval_set' in _fit_params:  # iterative model fit
--> 436         _fit_params['eval_set'] = list(map(lambda x: (
    437             self._create_X(x[0], feat_id_real), x[1]
...
--> 400     X_real = X[:, feat_id_real].copy()
    401     X_sha = X_real.copy()
    402     X_sha = np.apply_along_axis(self._random_state.permutation, 0, X_sha)

IndexError: index 21 is out of bounds for axis 1 with size 21```
cerlymarco commented 2 years ago

Hi @ds-sebastian,

the param boostboruta__eval_set in fit simply pass X_valid directly to BoostBoruta skipping the preprocessing made by ColumnTransformer. This behavior is intended by sklearn API (see fit_params section).

In other words, BoostBoruta fits on X_train (6000,24), which is scaled and encoded, while validating on X_valid (6000,21) not scaled and not encoded.

This happen also without shap-hypetune and simply using XGBClassifier in the pipeline

ds-sebastian commented 2 years ago

Hi @ds-sebastian,

the param boostboruta__eval_set in fit simply pass X_valid directly to BoostBoruta skipping the preprocessing made by ColumnTransformer. This behavior is intended by sklearn API (see fit_params section).

In other words, BoostBoruta fits on X_train (6000,24), which is scaled and encoded, while validating on X_valid (6000,21) not scaled and not encoded.

This happen also without shap-hypetune and simply using XGBClassifier in the pipeline

Ah ok, thanks, that's what I thought was happening. I think with GridSearchCV you can pass a pipeline. For a future improvement, could it be possible to do the same with this (awesome) package? Something like:

search = BoostBoruta(pipeline)
search.fit(X,y)

When I attempt this, it seems it can't identify the boost_type, which makes sense since it's a pipeline.