Ekeany / Boruta-Shap

A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values.
MIT License
559 stars 86 forks source link

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

Open wilmar-thomas opened 7 months ago

wilmar-thomas commented 7 months ago

Describe the bug

I don't know if this is the intended behaviour or not, but when setting train_or_test parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:

        if self.train_or_test.lower() == 'test':
            # keeping the same naming convenetion as to not add complexit later on
            self.X_boruta_train, self.X_boruta_test, self.y_train, self.y_test, self.w_train, self.w_test = train_test_split(self.X_boruta,
                                                                                                                                self.y,
                                                                                                                                self.sample_weight,
                                                                                                                                test_size=0.3,
                                                                                                                                random_state=self.random_state,
                                                                                                                                stratify=self.stratify)
            self.Train_model(self.X_boruta_train, self.y_train, sample_weight = self.w_train)

However, X_boruta_test is not used anywhere else, in fact the whole dataset X is used to derive feature importance, regardless of chosen train_or_test, cf. lines 856 and 873 for importance_measure == 'shap' :

self.shap_values = np.array(explainer.shap_values(self.X_boruta))
self.shap_values = explainer.shap_values(self.X_boruta)

and line 815 for importance_measure == 'perm' :

perm_importances_ = permutation_importance(self.model, self.X, self.y, scoring='f1')

While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.

Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?

Thank you for your help.