aai-institute / pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://pydvl.org
GNU Lesser General Public License v3.0
89 stars 9 forks source link

Merge three MCMC Shapley algorithms. #424

Open kosmitive opened 10 months ago

kosmitive commented 10 months ago

At this time there might be room to merge the algorithms. We should keep

Do you have more points for the unification of these algorithms?

mdbenito commented 9 months ago

I think we are mixing categories here.

So, what can we do?

kosmitive commented 9 months ago

mdbenito commented 9 months ago

So, to recap, for the meeting:

  • Agree

=> We leave TMCS as-is until we have always-on in-memory caching for single-node parallelism. If users can set up a cluster, they can set up memcached.

  • Yes implement TMC and CS-Shapley in terms of semivalues (with uniform weights). Although we need to generalize the concept of semivalues for CS-Shapley. I actually had an idea for generalizing CS-Shapley and semivalues, but let's talk about that in the next meeting.

ok

  • See above, we could create a whole new class of algorithms applicable to classification and regression problems unifying CS-Shapley and Semivalues.

All methods except CWS work with any supervised model. For CWS how would you translate the in-class and out-of-class concepts? With distance? This could use a configurable kernel...

  • For using other samplers in CS-Shapley, we would need to refactor the algorithm. It should be possible by using the combinatorial definition.

What about using permutations of the in-class and out-of-class subsets?

kosmitive commented 9 months ago

=> We leave TMCS as-is until we have always-on in-memory caching for single-node parallelism. If users can set up a cluster, they can set up memcached.

Sounds good.

All methods except CWS work with any supervised model. For CWS how would you translate the in-class and out-of-class concepts? With distance? This could use a configurable kernel...

By defining a neighborhood for each point i, could be based on the label or the feature.

What about using permutations of the in-class and out-of-class subsets?

Code need to be adapted but it should be possible, but I have to verify it especially for the out-of-class sets to be 100% sure.