NorskRegnesentral / shapr

Explaining the output of machine learning models with more accurately estimated Shapley values
https://norskregnesentral.github.io/shapr/
Other
138 stars 32 forks source link

Want to contribute - Latest updated source? #318

Open jonlachmann opened 1 year ago

jonlachmann commented 1 year ago

Hi!

I am interested in adding support for some new model classes, and saw the text in the readme saying that there are a lot of ongoing refactoring. I was just wondering if there is a branch which it is preferred to develop against, to avoid having to refactor more than necessary once the other code re-factorization is finished. Not really sure if this fits as an issue, but I figured it was the least inappropriate place to put this question.

Anyway, thanks for a well written R package!

martinju commented 1 year ago

Hi, and thanks for reaching out and your willingness to contribute.

If you want to contribute now, you should do so to the branch named "devel". The plan is to make that branch the master within this year.

Currently, the changes are not well documented, but at least the basic readme example and most of the vignette is up to date in the new branch. For your particular case of adding new model classes, I now put each model class in a new file, see e.g. https://github.com/NorskRegnesentral/shapr/blob/devel/R/model_ranger.R

jonlachmann commented 1 year ago

Hi, thanks for the clarification of which branch to work against!

My interest is working with time series models, to explain the predictions at each step in the future. For now I have managed to do so by running the explanation function one time for each future value. Do you have any suggestions on how to pass more complex x_explain values? What I would want to do is include a "steps ahead" parameter to the predict function, which would allow me to have say 3 explanations, one for each step in the future, using the same x_explain vector. The way that I solved it for now was to set this parameter on the model before passing it to the explain functions, but I think it is a big inefficient having to run the function one time for each step.

What would be even more neat is to be able to have the predict function return a vector, which would then contain multiple values which are all explained separately. It more or less implies supporting models with multiple outputs, which I think is a quite general case which would add a lot of usability to the package. Do you see any general problems with this train of thought?

martinju commented 1 year ago

Interesting application! Currently, the only way to achieve this is like you did, which is clearly inefficient. Technically, I think it would be problematic to support multiple output and I am not sure the usefulness trumps the technical burden of direct support of multiple output prediction models. Please convince me if you disagree :-)

I see two possible ways to by-pass the problem in your (and similar) cases:

  1. Allow passing of additional arguments to the prediction function on a per-row-in-x_explain basis.
  2. Passing the output of explain() as an argument to a new call of explain() with a new set x_explain, reusing different parts of the previous explain-object. [note that the new version of the package merges what was previously in the shapr() into explain().] This way you can explain your lag-predictions iteratively like you are doing today without the efficiency loss. This approach would also allow reducing the uncertainty of existing explanation by adding additional samples in the monte carlo integration of v(S), or subsets S in kernelSHAP approximation, should that be desired.

The plan is to support point 2, although I have not decided exactly how to do it yet, and it will not be ready for the release this year. Which solution would you prefer?

PS: In the devel version we have implemented a new approach for time series prediction explanation (see approach='timeseries' and an example in the new vignette).

jonlachmann commented 1 year ago

Yes, there are a lot of new things to take into consideration when explaining a forecast, like which value should be used as prediction_zero for example. I have a basic test case for a VAR model working as of now, just trying out things to see what gives results which makes sense, I will try to create a gist or something when it is in a more presentable form, to allow for more specific discussion if that would be of interest.

I will have to look into the code for the different approaches to see how it could be possible to do multiple outputs, it may very well be that you are correct that it is quite hard to get it working.

If point 1 would be supported, solving the multiple steps would be as easy as giving the step to forecast as an additional argument to the observations. I am however not sure if this would aid computational performance.

I will try to investigate the code a bit closer to be able to give some more informed answers! I will also look at the timeseries approach, is it ready to be used or under development?

martinju commented 1 year ago

Regarding the multiple output, I am mostly worried about the bookkeeping. I am happy to discuss solutions if you give it a go.

From my perspective point 1 and 2 would both be as computationally efficient as supporting multiple output directly, it would just be slightly less user friendly.

The timeseries approach should be ready for use. It is the simple method introduced in section 4.3 in this paper: https://martinjullum.com/publication/jullum-2021-efficient/jullum-2021-efficient.pdf

You may also be interested in a recent paper which discusses similar (and more advanced?) methods for the same use case: https://arxiv.org/ftp/arxiv/papers/2211/2211.06507.pdf (I have not read it properly myself.)

jonlachmann commented 1 year ago

Thanks for the links to the papers, will have a read when I get time. I have started some work on the code, made a small PR which I am not completely sure if it is useful, but I thought that I should start somewhere. Let me know if you have any comments.

jonlachmann commented 1 year ago

I did manage to implement multiple outputs, and also got to know the package and methods a bit. If you have a look here https://github.com/jonlachmann/shapr/tree/output_size, you can maybe see if I managed to do something stupid or if it seems reasonable.

martinju commented 1 year ago

Great! I just had a quick glance and couldn't spot any obvious issues. Could you upload a basic example script as well?

jonlachmann commented 1 year ago

Great to hear, I am not really used to working with data.table, so I learned a bit while getting it to work. I usually always just let everything be a matrix, but can see the benefits in this case where more complicated selections and aggregations are needed.

I formatted and commented my test example and created a gist here https://gist.github.com/jonlachmann/47bcfc4b1a072debd7278547992cd1a6, it should work given that you check out my branch and run it from inside that folder. Let me know if you get it to run or have any questions.