aai-institute / pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://pydvl.org
GNU Lesser General Public License v3.0
89 stars 9 forks source link

Feature/filter converged #437

Closed mdbenito closed 9 months ago

mdbenito commented 9 months ago

Description

This PR closes #303

Changes

Unrelated:

Checklist

kosmitive commented 9 months ago

Can you elaborate, why this parameter is necessary? Instead of doing that, we can also try to exclude marginal evaluations for converged indices, by removing them from the task queue.

mdbenito commented 9 months ago

Can you elaborate, why this parameter is necessary? Instead of doing that, we can also try to exclude marginal evaluations for converged indices, by removing them from the task queue.

Yes, sorry. This is exactly what is done. The parameter is necessary because just making it the default would change the behaviour for users. Also, very often values would converge only because of "unexpected" additional marginal computations.

This is something that happens often: we do AbsoluteStandardError(0.02) | MaxUpdates(1000). Then some values might fulfill the first criterion really fast, despite being very bad estimates. Because the other indices take much longer to have low stderr and done(result) is a global check, the "converged" ones keep being updated and end up being good estimates. If we stopped updating them, the values would be off by a lot. The way to fix this is by making the stderr threshold much lower, e.g. AbsoluteStandardError(1e-3) | MaxUpdates(1000). With skip_converged=True this more stringent check still takes less time than the first one.

mdbenito commented 9 months ago

This is something that happens often: we do AbsoluteStandardError(0.02) | MaxUpdates(1000). Then some values might fulfill the first criterion really fast, despite being very bad estimates. Because the other indices take much longer to have low stderr and done(result) is a global check, the "converged" ones keep being updated and end up being good estimates. If we stopped updating them, the values would be off by a lot. The way to fix this is by making the stderr threshold much lower, e.g. AbsoluteStandardError(1e-3) | MaxUpdates(1000). With skip_converged=True this more stringent check still takes less time than the first one.

Added something along these lines to the docs for stopping.py