Ekeany / Boruta-Shap

A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values.
MIT License
581 stars 88 forks source link

[BUG] create_shadow_features () - LightGBM + BorutaShap - AttributeError: 'Series' object has no attribute 'select_dtypes' #65

Open brendanwee opened 3 years ago

brendanwee commented 3 years ago

I am trying to run BorutaShap with a bunch of models. One of them being LightGBM When I run BorutaShap with a small number of rows, I get this attribute error in create_shadow_features()

... in <module>
     18 
     19 boruta_.fit(X=toy_X, y=toy_y.Label, n_trials=n_trials, normalize=normalize, sample=sample,
---> 20                 random_state=random_state, verbose=False)
     21 
     22 # results, fs = feature_selection.boruta(toy_df, french_f_y.Label, model=clf, n_trials=10, percentile=percentile,

.../BorutaShap.py in fit(self, X, y, n_trials, random_state, sample, train_or_test, normalize, verbose)
    344             self.remove_features_if_rejected()
    345             self.columns = self.X.columns.to_numpy()
--> 346             self.create_shadow_features()
    347 
    348             # early stopping

.../BorutaShap.py in create_shadow_features(self)
    541         self.X_shadow = self.X.apply(np.random.permutation)
    542         # append
--> 543         obj_col = self.X_shadow.select_dtypes("object").columns.tolist()
    544         if obj_col ==[] :
    545              pass

.../pandas/core/generic.py in __getattr__(self, name)
   5476         ):
   5477             return self[name]
-> 5478         return object.__getattribute__(self, name)
   5479 
   5480     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'select_dtypes'

The following code will reproduce the error:

import pandas as pd
from lightgbm import LGBMClassifier
from BorutaShap import BorutaShap

toy_X = pd.read_csv('toy_df.csv', index_col=0)
toy_y = pd.read_csv('toy_labels.csv', index_col=0)

model = LGBMClassifier()
importance_measure = 'shap'
percentile = 70
random_state = 0
normalize = True
sample = False
n_trials = 100

boruta_ = BorutaShap(model=model, importance_measure=importance_measure, classification=True,
                         percentile=percentile)

boruta_.fit(X=toy_X, y=toy_y.Label, n_trials=n_trials, normalize=normalize, sample=sample,
                random_state=random_state, verbose=False)

toy_df.csv toy_labels.csv

The same data will work with other classifiers, and BorutaShap works with LightGBM on other data sets. I suspect it has something to do with the dataset size, as I've seen this happen only with small subsets of the data.

Ekeany commented 3 years ago

Thanks for spotting this will have a look to see what's happening