Support dataset sharding

MadryLab / trak

A fast, effective data attribution method for neural networks in PyTorch

https://trak.csail.mit.edu/

MIT License

180 stars 25 forks source link

Support dataset sharding #29

Closed scoutsaachi closed 1 year ago

scoutsaachi commented 1 year ago

This line https://github.com/MadryLab/trak/blob/main/trak/traker.py#L268 should not zero out what is already in the target store. Change from 'w+' to 'r+'.

Also should add naming system so that you can score different targets

kristian-georgiev commented 1 year ago

The simplest solution would be to require users to specify experiment name (exp_name) in both start_scoring_checkpoint and finalize_scores. This way, target (gradient) arrays will be tied to an experiment name, and we don't need to rely on w+ for cleaning up between scoring different targets.

Unfortunately, this introduces a (minor) backward incompatibility --- running start_scoring_checkpoint without exp_name will error out.

kristian-georgiev commented 1 year ago

To fully support dataset sharding, my guess is that we'll need to make minor changes to the train gradient stores as well. Can you provide a minimal test with the desired functionality (computing TRAK scores over different parts of the dataset in parallel). Thanks!

scoutsaachi commented 1 year ago

honestly for now I think just swapping w+ to r+ is enough. It is weird/unexpected that the behavior is different between featurize (where we do not overwrite) and score (where we do).

More to the point, the inds argument is useless if you are going to overwrite on every call.

(I will create a test, but this seems like a pretty simple fix just for the basics)

kristian-georgiev commented 1 year ago

https://github.com/MadryLab/trak/commit/f0e12f2d4ee4aec6bc16cdea97d70fba4db5b182

kristian-georgiev commented 1 year ago

Resolved by https://github.com/MadryLab/trak/commit/03bc0eccefec4b47b3e323d1c0c330dc4c54e5de