Parallelized score finalization?

nickk124 commented 1 year ago

Hi,

First of all, thank you for an awesome library/paper and for the high quality documentation and support!

I'm using a very large attribution training set (> 1 M) and test set (> 50K), so I followed your tutorial at https://trak.readthedocs.io/en/latest/slurm.html to parallelize the featurization. Now that I'm at the step of score finalization, I'm getting an OOM error requiring >200 GB of GPU memory to compute this product:

scores = traker.finalize_scores(exp_name=trak_exp_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../site-packages/trak/traker.py", line 471, in finalize_scores _scores[:] += self.score_computer.get_scores(g, g_target).cpu().clone().detach().numpy() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../site-packages/trak/score_computers.py", line 97, in get_scores return features @ target_grads.T

I will likely try to divide up this computation into "blocks" to do separately to make it tractable, so I figured I'd raise an issue for it here too. I'm doing this using 20 models, by the way.

EDIT: it looks like this is similar to the PR here https://github.com/MadryLab/trak/pull/43. I'm going to attempt to resolve it by lowering CUDA_MAX_DIM_SIZE in trak.score_computers.BasicScoreComputer.

thanks!

nickk124 commented 1 year ago

EDIT 2: after that modification, now the OOM error is happening simply when trying to initialize the empty score tensor on GPU -- I'll try it on CPU, but otherwise I think this is just a limitation of my dataset size being so large. Marking as resolved :)

kristian-georgiev commented 1 year ago

Hey @nickk124. Thanks for opening this issue. We are in fact in the process of fixing this! Check out PR #43 (it's still a bit rough around the edges; it'll be merged in the next version of TRAK in the coming weeks). Essentially, we won't initialize the scores array on GPU and instead only compute blocks of it (on GPU). I'll close this issue once we merge.

nickk124 commented 1 year ago

@kristian-georgiev, thanks, great to hear!

kristian-georgiev commented 10 months ago

Resolved by https://github.com/MadryLab/trak/commit/16e9d4627c41292a4b81a0d28962dbc42803239c. Currently in the v0.3.0 branch.

MadryLab / trak

Parallelized score finalization? #45