aai-institute / pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://pydvl.org
GNU Lesser General Public License v3.0
89 stars 9 forks source link

Implement two-level caching on utility (level 1) and subset of indicies (level 2). #475

Open kosmitive opened 6 months ago

kosmitive commented 6 months ago

Running a benchmark with multiple experiments results in similar calls to the utility object. In between experiments one has to decide if the cache needs to be emptied or not.

Despite the same samples CWS can't use the same cache as TMC. Abstractly speaking it might happen that there are multiple experiments E1, E2, E3. Assume the transition form E1 to E2 invalidates the cache, but E3 can reuse the cache. By imposing the algorithmic knowledge when to invalidate caches on the user we disable the access to certain features to others. It would be favorable if there are multiple experiment caches. These high level caches might use a LRU(10) policy (the sub objects might use a LRU(10000) policy).

Verifying, that the same cache can be reused from a historic experiment EH, can be done with a signature of the utility object. As the value of the utility depends on the model, dataset and scorer. It also depends on the valuation method in our current state (CWS imposes a modified scorer), but this would be caught by estimating the equivalence of the scorer in the signature.

So my proposal would be to add: