christophM / interpretable-ml-book

Book about interpretable machine learning
https://christophm.github.io/interpretable-ml-book/
Other
4.75k stars 1.06k forks source link

Monte-Carlo Shapley algorithm #202

Closed diegobatt closed 2 years ago

diegobatt commented 4 years ago

The algorithm for Monte-Carlo estimation of the Shapley value is described as follows in Chapter 5 section 9

image

I believe the description is not clear about how the random permutation o is used. It asks to sort the instances x_o and z_o, which I assume are the x and z instances, respectively, given the random order o. However, the equation relates more to an instance sorted according to the original feature order. I say this because the first element is 1, the last one is p, and j is right in the middle, just as in the original order. On top of that, x_+j is fed to the model f with this same order. So is this order the original one or the one that results from the random permutation?

My understanding is that this should be more like:

I might be getting the whole Monte-Carlo idea wrong here, but I believe the description of the algorithm relates more to something like I described

For each iteration, a random instance z is selected from the data and a random order of the features is generated. Two new instances are created by combining values from the instance of interest x and the sample z. The instance x+j is the instance of interest, but all values in the order before feature j are replaced by feature values from the sample z

christophM commented 3 years ago

Hi there, thanks a lot for the issue and PR. I see the point you are making.

A solution might be to have in the index o(1) instead of (1). So for example x{+j} = (x{o(1)}, x_{o(2)}, .... and also keep the sentence you added in line 321. I am not a fan of introducing another letter y though.

What do you think?

diegobatt commented 3 years ago

Hi @christophM ! Thank you for the feedback. Shouldn't we discuss this in the PR threads instead of the issue?

adding o to the index will solve

    - Order instance x: $x_o=(x_{o(1)},\ldots,x_{(j)},\ldots,x_{o(p)})$
    - Order instance z: $z_o=(z_{o(1)},\ldots,z_{(j)},\ldots,z_{o(p)})$

but if I understood correctly when building x_{+j} and x_{-j} you will need to mix the original ordering with the new one. the new instances have to be in the original ordering for evaluating them, but the features that are replaced with z are taken at the right of o(j) in the ordering o (i.e all the features with o(i) > o(j)). That mixing is the one I struggled with the most. How would you state that ? I'm 100% percent on board with not adding y if possible

diegobatt commented 3 years ago

Maybe leveraging the fact that all this ordering ends up being equivalent to using x or z with 1/2 probability in the new instances ?

christophM commented 2 years ago

I guess we can discuss the issue here, as I am still trying to figure out how this issue is best described.

I am not sure if I understood what should be improved, but could it be that the following needs to be explained in the book? The ordering is merely a trick to create two new data instance, where some parts come from x and some from z. In the actual function call, in the code, we do not actually use the ordering. We have to leave the features in the same order as before, because the prediction functions usually expect the features to be always ordered in the same way. The ordering only affected for each feature value whether it came from x or z. Do you think that explanation would make it easier to understand the Shapley algorithm? And does it address your feedback?

diegobatt commented 2 years ago

Exactly, that explanation would address the feedback.

About the thing I said here:

Maybe leveraging the fact that all this ordering ends up being equivalent to using x or z with 1/2 probability in the new instances ?

I think that all the ordering and mixing is in the end equivalent to just getting an instance that gets the feature i != j from z or x with 1/2 probability. As the ordering is random, all sequences have the same probability, so half the sequence have the feature i on the right of j and the other half on the left.

Going for this approach might be the easiest way to explain it