Question: SHAP RFE vs Wrapper based RFE

markdregan commented 2 years ago

Hi Marco

Thanks for all the great write ups on medium and code on github. Great contributions!

I've a question hoping you could shed light on. I'm working with a dataset with many correlated features. And I'm building an automated method to select the most important features.

I'm looking at your work on SHAP with RFE and I'm trying to understand if and why it would be better than a wrapper based RFE. ie. a method when the model is trained on data with one column left out & the column that results in the least reduction n model performance (ie. AUC) is then removed from the dataset. This is repeated to find the optimal number and which features to include.

Do you have a position on which of these approaches are better? And if it is SHAP RFE I'd love to know why.

What holds me back from the SHAP RFE approach - is that shap values explain the effect a feature has on the model output prediction. But it doesn't necessarily say that this feature has a positive or negative impact of model performance. But my knowledge of SHAP is limited.

Anyway - appreciate to hear your thoughts.

Best Mark

cerlymarco commented 2 years ago

Hi, thanks for this feedback...

SHAP with RFE it's simply a wrapper-based RFE method, they are the same things. The only thing that changes it's how the feature importance is computed... with standard RFE, feature importance comes from the tree-based ml algo, while for SHAP RFE the feature importance comes from SHAP computation.

SHAP is good to deal with categorical feature importance or high cardinality feature importance. It can generalize well since it's model agnostic (not like tree-based feature importance), so less prone to be "overconfident". On the other side, it's a little bit slow in computation.

SHAP values are computed per sample and are negative or positive (i.e. positive/negative impact on the target according to their magnitude). SHAP feature importances are simply the mean (by columns) of shape values in absolute terms. This is a good approximation of the impact of the features on the model outcomes (no matter the direction, they simply have to be important for our predictive task).

All the best

markdregan commented 2 years ago

Thanks for this Marco. I have one follow up to this.

The RFE method I'm describing doesn't use tree based feature importance (which as you note has clear issues with categorical features and substitutions). But uses mlxtend SequentialFeatureSelector. This measures the AUC at each iteration and removes features that contribute the least to AUC. Below is a plot showing how model performance decay with additional features.

download

So my follow up question. SHAP seems to select features that explain the greatest amount of variance in model output. But doesn't necessarily explain changes in model performance. Have you found from your work that this is not a concern - or is that this characteristic is actually a plus / strength?

OR: does shap-hypertune actually explain the variance in log(loss)?

Side note: mlxtend SFS is also slow(ish) and acts in a greedy manner removing features that do not contribute to model performance (not just MDI of the tree). Back of my mind - I'm wondering if this greedy nature of SFS is a bad thing. AND/or if SHAP feature importance somehow has the upper hand on it.

Before I dive into experimenting / testing - thought I'd try tap your knowledge given all the work you've put into this.

Best Mark

cerlymarco / MEDIUM_NoteBook

Question: SHAP RFE vs Wrapper based RFE #39