aai-institute / pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://pydvl.org
GNU Lesser General Public License v3.0
89 stars 9 forks source link

Issue with SemiValue batching and parallelization #490

Open AnesBenmerzoug opened 5 months ago

AnesBenmerzoug commented 5 months ago

While working on PR #341, I realized that there is a bug in the batching feature of semivalues when using n_jobs > 1. The results are almost the same but not exactly the same.

AnesBenmerzoug commented 5 months ago

@kosmitive could you have a look at this if you have time?

kosmitive commented 5 months ago

Yes, it might be related to the parallelization as due to parallel processing, the order in which numbers arrive might be prone to a racing condition. I recall this occurs only for semivalues as we break down the calculation to single marginals. Or do you think it might be a different problem?

For the desribed problem, we could introduce a order resolver on the main thread, but at the cost of blowing up RAM on average of about N/2*C where N is the number of processes and C cost per process.

kosmitive commented 5 months ago

@AnesBenmerzoug Here is the reference which was made in the tests https://github.com/aai-institute/pyDVL/blob/7c003beeed00416f6f03dd9b3cd4be7a20339d25/tests/value/test_semivalues.py#L228-L229. Do we want to go for a order resolution object for the batches?

AnesBenmerzoug commented 5 months ago

@kosmitive I changed that test to use a deterministic scoring method coming from a toy game, so the order of batches shouldn't have an effect on the final result.

schroedk commented 2 weeks ago

Potentially resolved by #558