investigate SHAP feature importances

kmike commented 6 years ago

There is a recent paper which explains how to do explain_prediction for trees and tree ensembles, which they claim to be better than treeinterpreter-like measures: https://arxiv.org/pdf/1706.06060.pdf. It is already implemented for LightGBM (https://github.com/Microsoft/LightGBM/pull/825) and XGBoost (https://github.com/dmlc/xgboost/pull/2438). There is also a repo with model-agnostic explanations: https://github.com/slundberg/shap.

asperaa commented 5 years ago

I would like to Add SHAP support for ELI5. I will start with the small issues to know eli5 better. If there is any other information then please let me know.

AshwinB-hat commented 5 years ago

Hi ELI5 community, @kmike @lopuhin , I am equally interested in contributing and adding SHAP support for ELI5 through GSOC. Since I have been currently working on issue #276 and will be raising a PR in two days for explain_weights. I just wanted to inquire whether my proposal can include explain_prediction for catboost along with SHAP support?

lopuhin commented 5 years ago

Hi @AshwinB-hat that's great!

I just wanted to inquire whether my proposal can include explain_prediction for catboost along with SHAP support?

Since SHAP does support Catboost, it would be awesome to have explain_prediction for catboost via SHAP.

AshwinB-hat commented 5 years ago

Thanks for the prompt reply @lopuhin! I just discovered my heading for the application.

AshwinB-hat commented 5 years ago

@lopuhin @kmike @ivanprado Hey,

Regarding SHAP integration I had two doubts..

1) Should the format of the feature importance be the same. as in HTML styling and everything or can it be similar to the python shap library present. shap. Trying to figure out and weigh the pros and cons of implementing a custom shap as opposed to wrapping the existing shap. (which is well documented and optimized).

2) while investigating on SHAP performance, I also came across the reason why LIME is still used (significantly faster for unsupervised models) even though it is a subset of SHAP. I was wondering if we could provide an option for users to switch between the predictions for lime and shap. Maybe an extra parameter. The prediction time difference is large ~1:30 hours with only a slight drop in accuracy. Refer LIME vs SHAP for details.

lopuhin commented 5 years ago

Hey @AshwinB-hat great questions! @kmike is the primary mentor for this project, but from my point of view:

It would be great to provide SHAP explanations using the same unified eli5 interface. E.g. one point of feedback I got was that it's a pain to extract features importances from SHAP for further processing, while eli5 provides an export to a dataframe. So it would be nice to have tight integration here. But some way to obtain original SHAP visualization also looks useful.
That's interesting observation. I observed SHAP to be very fast for XGBoost and LightGBM, maybe a different algorithm is used for them? And LIME on the other hand is usually quite slow in comparison, in my experience, so it may depend on the models and datasets used. Regarding providing an option for using LIME, I see it more of a separate project of integrating LIME, which is probably quite significant, especially given that we also have a slightly different LIME implementation inside eli5. But if time is left from SHAP integration, it would be interesting to look into this.

AshwinB-hat commented 5 years ago

1) Yes that makes sense. pros: the shap visualisation. cons: extraction of features. I will look more deeply into the extraction of features.

2) SHAP the algorithmn is slower than LIME since it is a superset. But the implementation of The SHAP Python library helps with this compute problem by using approximations and optimizations to greatly speed things up while seeking to keep the nice Shapley properties. Except for clustering algorithms the shap implementation is more efficient than LIME.

As far as I have inferred, The second question stands cleared. I will look into optimization of (KNN) for SHAP or work around. Lime can be considered as a separate project.

Regarding the first question. Implementing the SHAP features can be done but the issue is, we will have to do custom C++ implementations to optimise. I personally think the pain of extracting features will be bearable as opposed to reverse engineering the current shap and implementing it again. Your suggestions will be of great help here. @lopuhin @kmike @ivanprado

lopuhin commented 5 years ago

@AshwinB-hat I'll be sure to read up more on different algorithms SHAP uses - but my idea is that the primary goal is to use SHAP library implementation mentioned in this section: https://github.com/slundberg/shap#tree-ensemble-example-with-treeexplainer-xgboostlightgbmcatboostscikit-learn-models (Fast C++ implementations are supported for XGBoost, LightGBM, CatBoost, and scikit-learn tree models), and we didn't think of doing any custom implementation in eli5

AshwinB-hat commented 5 years ago

@lopuhin Sure. This doubt had arisen because on the ideas page. gsoc ideas eli5 it was mentioned either wrap or implement.

I will look into extracting the feature importance from the shap implementation. Thanks

lopuhin commented 5 years ago

@AshwinB-hat indeed, you are right, I missed this. I would defer this to @kmike to clarify :)

AshwinB-hat commented 5 years ago

@lopuhin @kmike I have Just created my first draft of the proposal. Please let me know if I need to change/improve anything. GSOC proposal 2019 Thanks

kmike commented 5 years ago

Hey @AshwinB-hat !

SHAP an interesting beast, because on a practical side, it combines both an algorithm which is LIME-like, and an algorithm which is treeinterpreter-like. They have somewhat different use cases, and different performance characteristics.

Currently in eli5 explain_prediction for all decision tree algorithms use treeinterpreter-like algorithm. But SHAP is strictly better, so I think one of the goals should be to switch to SHAP for tree ensembles by default, using the current API.

I'm unsure if we should be wrapping shap package or not. Things to consider:

If I'm not mistaken, actual C++ implementations of decision paths are in xgboost, catboost, lightgbm now; they support SHAP importances natively - if we're talking about SHAP for tree ensembles, not a general shap algorithm.
It is more work and maintenance to support our own implementation, and bring it up to a level of shap package.
There is a value in re-implementing algorithms and having alternative implementations
shap library has a good set of visualisation options, as well as a wide range of SHAP algorithms implemented.

Maybe we can even do both - have our own implementation of basics, when it is easy, e.g. when most work is done by a package we're explaining (e.g. check pred_contribs and pred_interactions arguments in XGBoost) - this would allow to avoid external dependency in simple cases. At the same time, I don't think we should be re-implementing the whole shap package, so integration with it for more advanced features (like plotting, or other explanation algorithms) is desired as well.

AshwinB-hat commented 5 years ago

Hey @kmike , Thanks for clearing my doubts, I think it makes sense to use both, our own implementation and integrate shap package for advanced features. Although I'm inquisitive on the edge we might have by implementing SHAP to produce shap values. As far as I am aware, the shap library implements the shapley value calculation as stated in the TreeShap paper 2018.

Based on the shap values the further estimations are done. Im currently looking at the xgboost docs for their native implementation but I am unable to find any example or implementation that does not use the shap library. It would be helpful if you can link some resources.

Also, I have written a mock draft for the GSOC proposal due tomorrow. I will be grateful if you can point out flaws and areas I can improve on.

I have only considered wrapping the shap library so far as I could not find a better reason not to. Your suggestions will be valuable.

Thanks .

TeamHG-Memex / eli5

investigate SHAP feature importances #240