I don't know if this is the intended behaviour or not, but when setting train_or_test parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:
if self.train_or_test.lower() == 'test':
# keeping the same naming convenetion as to not add complexit later on
self.X_boruta_train, self.X_boruta_test, self.y_train, self.y_test, self.w_train, self.w_test = train_test_split(self.X_boruta,
self.y,
self.sample_weight,
test_size=0.3,
random_state=self.random_state,
stratify=self.stratify)
self.Train_model(self.X_boruta_train, self.y_train, sample_weight = self.w_train)
However, X_boruta_test is not used anywhere else, in fact the whole dataset X is used to derive feature importance, regardless of chosen train_or_test, cf. lines 856 and 873 for importance_measure == 'shap' :
While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.
Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?
Describe the bug
I don't know if this is the intended behaviour or not, but when setting
train_or_test
parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:However,
X_boruta_test
is not used anywhere else, in fact the whole datasetX
is used to derive feature importance, regardless of chosentrain_or_test
, cf. lines 856 and 873 forimportance_measure == 'shap'
:and line 815 for
importance_measure == 'perm'
:While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.
Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?
Thank you for your help.