Closed jvdboogaard closed 3 years ago
I am very pleased that you are using SAFE in your research. I think adding SHAP dependence plot is a great idea. I am the developer of the R version for the SAFE framework. Maybe you could do a PR with your changes? Then I can help to debug it.
@olagacek @plubon Maybe you are interested in taking care of the implementation?
Hi! Thank you for your fast response! I am glad you agree with this idea. As far as I know there is no implementation of SHAP (DP) in R. The documentation of (general) SHAP can be found here: https://github.com/slundberg/shap , and the SHAP Dependence Plot documentation here: https://slundberg.github.io/shap/notebooks/plots/dependence_plot.html.
Do you think with this you will be able to easily add it to the python package? If not, how do you recommend me to approach this issue?
I'm looking forward to your response.
Kind regards, Jeroen van den Boogaard
Van: Alicja Gosiewska @.> Verzonden: woensdag 11 augustus 2021 15:58 Aan: ModelOriented/SAFE @.> CC: jvdboogaard @.>; Author @.> Onderwerp: Re: [ModelOriented/SAFE] Alternative for partial dependence plot (#10)
Hello :) I am very happy that you are using SAFE in your research and I think adding SHAP dependence plot is a great idea. Do you know if there is any implementation of SHAP DP in R? I will then be able to easily add it to the rSAFE package.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ModelOriented/SAFE/issues/10#issuecomment-896947734, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS2CPK3GNZEQ7BVRKRJUDX3T4KMY5ANCNFSM5BXIDHZQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.
Hi, sorry, I am a developer of the rSAFE and in my first comment I thought it was about the R version. In SAFE we use PDP plots, find changepoints, and then use them for binning continuous variables. I see that SHAP DP are scatter plots, so I am not sure, how to use them in a similar way, we use PDP. How would you like o use them?
I ran into the same issue. My idea is to draw a function/line through it, where we take for example the average of all dots in a 'column'. In this way every column of dots becomes one dot, which can be connected with a line to. What do you think of this?
Ok, I've implemented it. Right now it is on a separate branch. You can try it out. And please, let me know, if it works for you :) https://github.com/ModelOriented/SAFE/tree/feature/shap_dependence_profile
Installation:
pip install git+https://github.com/ModelOriented/SAFE.git@feature/shap_dependence_profile
Example for shap dependence: https://github.com/ModelOriented/SAFE/blob/feature/shap_dependence_profile/examples/SHAP_dependence.ipynb
Wauw that's great, thank you so much. I will let you know as soon as possible. My data set is quite big though, so it might take some time.
Hi Alicja, I am sorry for my late response, unfortunately I was not able to work on my research last weekend.
When I apply the new code on (a small part of) my data, it seems to indeed transform the features. However, when I try to incorporate the transformed data in a linear regression (using Pipeline) I get an error, which I can't get rid off. The error is as follows: "ValueError: Length of passed values is 1, index implies 2."
I have also tried it with your data (apartments.csv) with the same code as your example but with the following added: Which gives (almost) the same error:
Do you have any idea how to fix this?
In addition to this I have another question. Is it possible to also plot the shap dependence plots together with the transformations/changepoints and also plot the categorical variable transformation? In my research I want to have these specific transformations visualized.
Thank you in advance.
Kind regards, Jeroen
Hi Jeroen!
Do you have any idea how to fix this?
The bug was for categorical variables, so it was not related to shap PD transformations. It turned out that categorical transformations did not work with some versions of python and pandas. I have added a fix so it should work for you now. Changes are on the same branch as before: https://github.com/ModelOriented/SAFE/tree/feature/shap_dependence_profile. You just need to reinstall package.
In addition to this I have another question. Is it possible to also plot the shap dependence plots together with the transformations/changepoints and also plot the categorical variable transformation? In my research I want to have these specific transformations visualized.
There are no visualizations implemented in SAFE, and to my knowledge, there are no plans to do so now. So the fastest way is to generate the plots yourself.
Hi Alicja,
Thank you very much for fixing the issue! It seems to work properly now on the small test data set. I will apply it to the whole data set to investigate the increase in performance. I will definitely keep you updated! But I think it will take some time.
Cool! Good luck with your master thesis!
Hi Alicja,
Thanks again for implementing the adjustments to the code. I have ran the code for both shap and pdp for the argument dependence_method, and obtained a better performance with the shap dependence method for XGBoost as the surrogate model. In particular, an improvement in the Gini coefficient of approximately 8%. If you would like to hear more about this please let me know!
However, I encountered a new problem. When I run the code with shap as dependence method and random forest as the surrogate model, then the code seems to run infinitely. However, when I run the code with pdp as dependence method the program is done within half an hour. As extra information: when I run the code with dependence_method=shap en surrogate_model=XGBoostclassifier it takes also about half an hour to run. So there is something that goes wrong when dependence_method=shap in combination with surrogate_model=RandomForestClassifier().
I am looking forward to your response.
Wkr, Jeroen
Hi Jeroen!
Great to hear that you were able to improve the performance! I am very interested in hearing more about your experiments.
The reason for time-consuming computations is that random forest has much deeper trees than xgboost. (See issues in shap repository: https://github.com/slundberg/shap/issues/14#issuecomment-357755171, https://github.com/slundberg/shap/issues/1993).
Unfortunately, I do not think that I can do anything about it from the SAFE level.
Best wishes Alicja
Hi Alicja,
Thanks for you fast response! I can send you my research when I am finished, which is probably half of October. If you would be interested in receiving this please send me your email address so I can send it when I'm done.
It is a pity that this is so time-consuming, but then I will just have to deal with that. Thanks for the answer anyways!
Wkr, Jeroen
Dear Gosiewska,
I am currently using your SAFE code together with you SAFE ML paper as a part of my master thesis for my Econometrics master Quantitative Finance (in the Netherlands). I love your approach and I am very curious for the results in my research. However, there is a small adjustment that I would like to make in the code. Namely, I want to use the SHAP dependence plot as an alternative for the partial dependence plot (inspired by the paper of Lundberg: Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. "Consistent individualized feature attribution for tree ensembles." arXiv preprint arXiv:1802.03888 (2018).). Lundberg states that the SHAP values are the only consistent feature attribution values, such that the SHAP dependence plot is a rich alternative to the partial dependence plot.
However, I am not specifically a master in programming, so I have failed so far to adjust this part of the code. Please let me know your thoughts on this adjustment and if you would like to implement this.
With kind regards,
Jeroen van den Boogaard