ModelOriented / SAFE

Surrogate Assisted Feature Extraction
MIT License
36 stars 9 forks source link

Alternative for partial dependence plot #10

Closed jvdboogaard closed 3 years ago

jvdboogaard commented 3 years ago

Dear Gosiewska,

I am currently using your SAFE code together with you SAFE ML paper as a part of my master thesis for my Econometrics master Quantitative Finance (in the Netherlands). I love your approach and I am very curious for the results in my research. However, there is a small adjustment that I would like to make in the code. Namely, I want to use the SHAP dependence plot as an alternative for the partial dependence plot (inspired by the paper of Lundberg: Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. "Consistent individualized feature attribution for tree ensembles." arXiv preprint arXiv:1802.03888 (2018).). Lundberg states that the SHAP values are the only consistent feature attribution values, such that the SHAP dependence plot is a rich alternative to the partial dependence plot.

However, I am not specifically a master in programming, so I have failed so far to adjust this part of the code. Please let me know your thoughts on this adjustment and if you would like to implement this.

With kind regards,

Jeroen van den Boogaard

agosiewska commented 3 years ago

I am very pleased that you are using SAFE in your research. I think adding SHAP dependence plot is a great idea. I am the developer of the R version for the SAFE framework. Maybe you could do a PR with your changes? Then I can help to debug it.

@olagacek @plubon Maybe you are interested in taking care of the implementation?

jvdboogaard commented 3 years ago

Hi! Thank you for your fast response! I am glad you agree with this idea. As far as I know there is no implementation of SHAP (DP) in R. The documentation of (general) SHAP can be found here: https://github.com/slundberg/shap , and the SHAP Dependence Plot documentation here: https://slundberg.github.io/shap/notebooks/plots/dependence_plot.html.

Do you think with this you will be able to easily add it to the python package? If not, how do you recommend me to approach this issue?

I'm looking forward to your response.

Kind regards, Jeroen van den Boogaard


Van: Alicja Gosiewska @.> Verzonden: woensdag 11 augustus 2021 15:58 Aan: ModelOriented/SAFE @.> CC: jvdboogaard @.>; Author @.> Onderwerp: Re: [ModelOriented/SAFE] Alternative for partial dependence plot (#10)

Hello :) I am very happy that you are using SAFE in your research and I think adding SHAP dependence plot is a great idea. Do you know if there is any implementation of SHAP DP in R? I will then be able to easily add it to the rSAFE package.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ModelOriented/SAFE/issues/10#issuecomment-896947734, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS2CPK3GNZEQ7BVRKRJUDX3T4KMY5ANCNFSM5BXIDHZQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

agosiewska commented 3 years ago

Hi, sorry, I am a developer of the rSAFE and in my first comment I thought it was about the R version. In SAFE we use PDP plots, find changepoints, and then use them for binning continuous variables. I see that SHAP DP are scatter plots, so I am not sure, how to use them in a similar way, we use PDP. How would you like o use them?

jvdboogaard commented 3 years ago

I ran into the same issue. My idea is to draw a function/line through it, where we take for example the average of all dots in a 'column'. In this way every column of dots becomes one dot, which can be connected with a line to. What do you think of this?

agosiewska commented 3 years ago

Ok, I've implemented it. Right now it is on a separate branch. You can try it out. And please, let me know, if it works for you :) https://github.com/ModelOriented/SAFE/tree/feature/shap_dependence_profile

Installation:

pip install git+https://github.com/ModelOriented/SAFE.git@feature/shap_dependence_profile

Example for shap dependence: https://github.com/ModelOriented/SAFE/blob/feature/shap_dependence_profile/examples/SHAP_dependence.ipynb

jvdboogaard commented 3 years ago

Wauw that's great, thank you so much. I will let you know as soon as possible. My data set is quite big though, so it might take some time.

jvdboogaard commented 3 years ago

Hi Alicja, I am sorry for my late response, unfortunately I was not able to work on my research last weekend.

When I apply the new code on (a small part of) my data, it seems to indeed transform the features. However, when I try to incorporate the transformed data in a linear regression (using Pipeline) I get an error, which I can't get rid off. The error is as follows: "ValueError: Length of passed values is 1, index implies 2."

I have also tried it with your data (apartments.csv) with the same code as your example but with the following added: image Which gives (almost) the same error: image

image

Do you have any idea how to fix this?

In addition to this I have another question. Is it possible to also plot the shap dependence plots together with the transformations/changepoints and also plot the categorical variable transformation? In my research I want to have these specific transformations visualized.

Thank you in advance.

Kind regards, Jeroen

agosiewska commented 3 years ago

Hi Jeroen!

Do you have any idea how to fix this?

The bug was for categorical variables, so it was not related to shap PD transformations. It turned out that categorical transformations did not work with some versions of python and pandas. I have added a fix so it should work for you now. Changes are on the same branch as before: https://github.com/ModelOriented/SAFE/tree/feature/shap_dependence_profile. You just need to reinstall package.

In addition to this I have another question. Is it possible to also plot the shap dependence plots together with the transformations/changepoints and also plot the categorical variable transformation? In my research I want to have these specific transformations visualized.

There are no visualizations implemented in SAFE, and to my knowledge, there are no plans to do so now. So the fastest way is to generate the plots yourself.

jvdboogaard commented 3 years ago

Hi Alicja,

Thank you very much for fixing the issue! It seems to work properly now on the small test data set. I will apply it to the whole data set to investigate the increase in performance. I will definitely keep you updated! But I think it will take some time.

agosiewska commented 3 years ago

Cool! Good luck with your master thesis!

jvdboogaard commented 3 years ago

Hi Alicja,

Thanks again for implementing the adjustments to the code. I have ran the code for both shap and pdp for the argument dependence_method, and obtained a better performance with the shap dependence method for XGBoost as the surrogate model. In particular, an improvement in the Gini coefficient of approximately 8%. If you would like to hear more about this please let me know!

However, I encountered a new problem. When I run the code with shap as dependence method and random forest as the surrogate model, then the code seems to run infinitely. However, when I run the code with pdp as dependence method the program is done within half an hour. As extra information: when I run the code with dependence_method=shap en surrogate_model=XGBoostclassifier it takes also about half an hour to run. So there is something that goes wrong when dependence_method=shap in combination with surrogate_model=RandomForestClassifier().

I am looking forward to your response.

Wkr, Jeroen

agosiewska commented 3 years ago

Hi Jeroen!

Great to hear that you were able to improve the performance! I am very interested in hearing more about your experiments.

The reason for time-consuming computations is that random forest has much deeper trees than xgboost. (See issues in shap repository: https://github.com/slundberg/shap/issues/14#issuecomment-357755171, https://github.com/slundberg/shap/issues/1993).
Unfortunately, I do not think that I can do anything about it from the SAFE level.

Best wishes Alicja

jvdboogaard commented 3 years ago

Hi Alicja,

Thanks for you fast response! I can send you my research when I am finished, which is probably half of October. If you would be interested in receiving this please send me your email address so I can send it when I'm done.

It is a pity that this is so time-consuming, but then I will just have to deal with that. Thanks for the answer anyways!

Wkr, Jeroen