Merge three MCMC Shapley algorithms.

kosmitive commented 10 months ago

At this time there might be room to merge the algorithms. We should keep

[ ] Sampler pattern (Beta-Shapley, it makes reproducibility easier, Permutation sampling is covered)
[ ] Executor pattern (First was TMC-Shapley, but CS-Shapley and Beta-Shapley as well)
[ ] Weight pattern (Uniform for Classwise-Shapley and TMC-Shapley)
[ ] Context pattern (Complement set in Classwise-Shapley)
[ ] ???? History pattern

Do you have more points for the unification of these algorithms?

mdbenito commented 9 months ago

I think we are mixing categories here.

What you call "sampler pattern" applies to all semivalues. The core uses futures.
TMC, CS and all semivalues use futures. This is a slight code duplication wrt. semivalues. TMC shapley runs full permutations in workers, and semivalues send one sample at a time. batching can make both equal thanks to your PR. Another difference that warrants the existence of tmc shapley as a separate routine is the fact that the general implementation doubles computation wrt tmc if cache is disabled because tmc stores the previous utility in a tmp variable in the inner loop, and semivalues recomputes at each stage, and thus relies on caching.
I don't understand what you mean with "weight pattern". tmc shapley is just a semivalue, so it is subsumed into semivalues, except for the caching issue. The fact that we sample uniformly in CS-Shapley can be changed by using a sampler as in the semivalues.
Computing "conditional" utilities in cs-shapley is specific to that method. I'd leave it at that until and only if we find good reason to generalise because of a new method.
See my comment in #416

So, what can we do?

[ ] Fix the semivalue computation to work with permutation sampling without requiring caching (low prio, since caching is trivial to enable, and we might have a simple implementation not relying on memcached soon)
[ ] Allow other samplers in CS-Shapley (you tell me how hard this would be). The biggest benefit would be if we ever implement good variance-reduced samplers.

kosmitive commented 9 months ago

And it is indeed a very nice pattern to use a sampler. Reduces actually the reproducibility test cases drastically.
Agree
Yes implement TMC and CS-Shapley in terms of semivalues (with uniform weights). Although we need to generalize the concept of semivalues for CS-Shapley. I actually had an idea for generalizing CS-Shapley and semivalues, but let's talk about that in the next meeting.
See above, we could create a whole new class of algorithms applicable to classification and regression problems unifying CS-Shapley and Semivalues.

[ ] For using other samplers in CS-Shapley, we would need to refactor the algorithm. It should be possible by using the combinatorial definition.

mdbenito commented 9 months ago

So, to recap, for the meeting:

Agree

=> We leave TMCS as-is until we have always-on in-memory caching for single-node parallelism. If users can set up a cluster, they can set up memcached.

Yes implement TMC and CS-Shapley in terms of semivalues (with uniform weights). Although we need to generalize the concept of semivalues for CS-Shapley. I actually had an idea for generalizing CS-Shapley and semivalues, but let's talk about that in the next meeting.

ok

See above, we could create a whole new class of algorithms applicable to classification and regression problems unifying CS-Shapley and Semivalues.

All methods except CWS work with any supervised model. For CWS how would you translate the in-class and out-of-class concepts? With distance? This could use a configurable kernel...

For using other samplers in CS-Shapley, we would need to refactor the algorithm. It should be possible by using the combinatorial definition.

What about using permutations of the in-class and out-of-class subsets?

kosmitive commented 9 months ago

=> We leave TMCS as-is until we have always-on in-memory caching for single-node parallelism. If users can set up a cluster, they can set up memcached.

Sounds good.

All methods except CWS work with any supervised model. For CWS how would you translate the in-class and out-of-class concepts? With distance? This could use a configurable kernel...

By defining a neighborhood for each point i, could be based on the label or the feature.

What about using permutations of the in-class and out-of-class subsets?

Code need to be adapted but it should be possible, but I have to verify it especially for the out-of-class sets to be 100% sure.

aai-institute / pyDVL

Merge three MCMC Shapley algorithms. #424