aai-institute / pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://pydvl.org
GNU Lesser General Public License v3.0
109 stars 8 forks source link

Refactor MSR Banzhaf valuation #605

Closed janosg closed 4 months ago

janosg commented 5 months ago

Description

This PR implements Maximum Sample Re-use (MSR) Banzhaf valuation in the new architecture. The implementation deviates strongly from the previous implementation and fixes a bug in the variance estimation.

The new implementation uses two ValuationResult instances to keep track of the positive and negative running means. After each update, those are combined into the final result object. The update counter of the combined result is set to the minimum of the two update counters. The variance of the combined result is set to the sum of variances (assuming independence).

Open questions

Checklist

janosg commented 5 months ago

I clarified the documentation of ValuationResult.variance, but I still think it is a bit misleading that ValuationResult.variances is not just the square of ValuationResult.stderr.

I think it would be clearer if we only expose the square root of the variances as ValuationResult.stdev; Then it's clear that the difference betweenn stdev and stderr must be a conceptual one. Also, most of the time standard deviations are more interpretable than variances.