Open kmike opened 6 years ago
I would like to Add SHAP support for ELI5. I will start with the small issues to know eli5 better. If there is any other information then please let me know.
Hi ELI5 community, @kmike @lopuhin ,
I am equally interested in contributing and adding SHAP support for ELI5 through GSOC.
Since I have been currently working on issue #276 and will be raising a PR in two days for explain_weights
.
I just wanted to inquire whether my proposal can include explain_prediction
for catboost along with SHAP support?
Hi @AshwinB-hat that's great!
I just wanted to inquire whether my proposal can include explain_prediction for catboost along with SHAP support?
Since SHAP does support Catboost, it would be awesome to have explain_prediction for catboost via SHAP.
Thanks for the prompt reply @lopuhin! I just discovered my heading for the application.
@lopuhin @kmike @ivanprado Hey,
Regarding SHAP integration I had two doubts..
1) Should the format of the feature importance be the same. as in HTML styling and everything or can it be similar to the python shap library present. shap. Trying to figure out and weigh the pros and cons of implementing a custom shap as opposed to wrapping the existing shap. (which is well documented and optimized).
2) while investigating on SHAP performance, I also came across the reason why LIME is still used (significantly faster for unsupervised models) even though it is a subset of SHAP. I was wondering if we could provide an option for users to switch between the predictions for lime and shap. Maybe an extra parameter. The prediction time difference is large ~1:30 hours with only a slight drop in accuracy. Refer LIME vs SHAP for details.
Hey @AshwinB-hat great questions! @kmike is the primary mentor for this project, but from my point of view:
It would be great to provide SHAP explanations using the same unified eli5 interface. E.g. one point of feedback I got was that it's a pain to extract features importances from SHAP for further processing, while eli5 provides an export to a dataframe. So it would be nice to have tight integration here. But some way to obtain original SHAP visualization also looks useful.
That's interesting observation. I observed SHAP to be very fast for XGBoost and LightGBM, maybe a different algorithm is used for them? And LIME on the other hand is usually quite slow in comparison, in my experience, so it may depend on the models and datasets used. Regarding providing an option for using LIME, I see it more of a separate project of integrating LIME, which is probably quite significant, especially given that we also have a slightly different LIME implementation inside eli5. But if time is left from SHAP integration, it would be interesting to look into this.
1) Yes that makes sense. pros: the shap visualisation. cons: extraction of features. I will look more deeply into the extraction of features.
2) SHAP the algorithmn is slower than LIME since it is a superset. But the implementation of The SHAP Python library helps with this compute problem by using approximations and optimizations to greatly speed things up while seeking to keep the nice Shapley properties. Except for clustering algorithms the shap implementation is more efficient than LIME.
As far as I have inferred, The second question stands cleared. I will look into optimization of (KNN) for SHAP or work around. Lime can be considered as a separate project.
Regarding the first question. Implementing the SHAP features can be done but the issue is, we will have to do custom C++ implementations to optimise. I personally think the pain of extracting features will be bearable as opposed to reverse engineering the current shap and implementing it again. Your suggestions will be of great help here. @lopuhin @kmike @ivanprado
@AshwinB-hat I'll be sure to read up more on different algorithms SHAP uses - but my idea is that the primary goal is to use SHAP library implementation mentioned in this section: https://github.com/slundberg/shap#tree-ensemble-example-with-treeexplainer-xgboostlightgbmcatboostscikit-learn-models (Fast C++ implementations are supported for XGBoost, LightGBM, CatBoost, and scikit-learn tree models), and we didn't think of doing any custom implementation in eli5
@lopuhin Sure. This doubt had arisen because on the ideas page. gsoc ideas eli5 it was mentioned either wrap or implement.
I will look into extracting the feature importance from the shap implementation. Thanks
@AshwinB-hat indeed, you are right, I missed this. I would defer this to @kmike to clarify :)
@lopuhin @kmike I have Just created my first draft of the proposal. Please let me know if I need to change/improve anything. GSOC proposal 2019 Thanks
Hey @AshwinB-hat !
SHAP an interesting beast, because on a practical side, it combines both an algorithm which is LIME-like, and an algorithm which is treeinterpreter-like. They have somewhat different use cases, and different performance characteristics.
Currently in eli5 explain_prediction for all decision tree algorithms use treeinterpreter-like algorithm. But SHAP is strictly better, so I think one of the goals should be to switch to SHAP for tree ensembles by default, using the current API.
I'm unsure if we should be wrapping shap package or not. Things to consider:
Maybe we can even do both - have our own implementation of basics, when it is easy, e.g. when most work is done by a package we're explaining (e.g. check pred_contribs and pred_interactions arguments in XGBoost) - this would allow to avoid external dependency in simple cases. At the same time, I don't think we should be re-implementing the whole shap package, so integration with it for more advanced features (like plotting, or other explanation algorithms) is desired as well.
Hey @kmike , Thanks for clearing my doubts, I think it makes sense to use both, our own implementation and integrate shap package for advanced features. Although I'm inquisitive on the edge we might have by implementing SHAP to produce shap values. As far as I am aware, the shap library implements the shapley value calculation as stated in the TreeShap paper 2018.
Based on the shap values the further estimations are done. Im currently looking at the xgboost docs for their native implementation but I am unable to find any example or implementation that does not use the shap library. It would be helpful if you can link some resources.
Also, I have written a mock draft for the GSOC proposal due tomorrow. I will be grateful if you can point out flaws and areas I can improve on.
I have only considered wrapping the shap library so far as I could not find a better reason not to. Your suggestions will be valuable.
Thanks .
There is a recent paper which explains how to do explain_prediction for trees and tree ensembles, which they claim to be better than treeinterpreter-like measures: https://arxiv.org/pdf/1706.06060.pdf. It is already implemented for LightGBM (https://github.com/Microsoft/LightGBM/pull/825) and XGBoost (https://github.com/dmlc/xgboost/pull/2438). There is also a repo with model-agnostic explanations: https://github.com/slundberg/shap.