CIFAR-10 Trak Scores Reproduction

vasusingla commented 1 year ago

Hi,

Thanks for the amazing work, and for releasing your code.

I was trying to reproduce your results for CIFAR-10 using your quickstart notebook. However, the scores do not seem semantically meaningful. I tried visualizing the closest train images to a test image. I'm attaching the results below. The first image is the test image, next 10 images are the closest train samples.

TRAK -

DataModels -

kristian-georgiev commented 1 year ago

Thank you for your question! This is, in fact, expected because our goal for the quickstart notebook is to provide a lightweight end-to-end example of the API, not directly reproduce any of our results in the paper. In particular, in the quickstart notebook we use different hyperparameters from the paper (3 checkpoints, projection dimension 2048). In contrast, in our CIFAR-10 experiments in the paper, we use a projection dimension of 20,000 and start matching the counterfactual performance of certain datamodels only after using >10 checkpoints (see Figure 1, Table A.1 in our paper for more details).

That being said, thank you for pointing out this discrepancy --- I agree it can be quite confusing to see bad qualitative results from the quickstart. We'll update the notebook to use (slightly) heavier hyperparameters so that it computes useful scores; I'll link the commit introducing this change here.

psandovalsegura commented 1 year ago

Hi Kristian, thanks for open-sourcing TRAK. It's an interesting method and well-written paper.

I am also working on reproducing results using the cifar quickstart notebook. I've read your previous response and tried to use a projection dimension of 20k, but got the following error from trak/projectors.py:

ValueError: Invalid Number of JL dimensions it has to be a multiple of 512

so I opted to use 20480 dimensions for the TRAKer instead. Next, using 20 ResNet-9 checkpoints, I ran the remainder of the cifar quickstart notebook and these are the results for the closest train samples (bottom row) to the test image at index 40 (top):

It still appears like the scores are not semantically meaningful. Not sure if this helps debug, but highest TRAK score for this test image is 0.003849233 (corresponding to first train image in bottom row). I'm not quite sure where to look to fix this issue, but I've written out some questions:

Should I use a proj dim of 20480 for reproducing results from the paper?
Currently TRAKer.save_dir for the experiment above (using 20 checkpoints) takes up 120GB. Is this expected?
Is there any plan to release the (50000, 10000) scores matrix for TRAK_20 and TRAK_100?

Thanks for your help!

xszheng2020 commented 1 year ago

Same issue here, the trak scores computed by 25 ckpts and 20480 proj dim are not good. (But I use Normal projector instead because Rademacher has some problems with my machine #24 )

BTW, in cifar_quickstart.ipynb, we do not subsample the training set since there is no hyperparameter \alpha?

kristian-georgiev commented 1 year ago

Thank you @vasusingla @psandovalsegura @xszheng2020 for identifying the issue. It turned out that there was a small bug in the TRAKer class causing a mismatch between checkpoints used for featurizing and scoring, fixed here https://github.com/MadryLab/trak/pull/28. I've (slightly) updated the quickstart notebook to get visually good results with a minimal amount of compute. Going to add links to masks & margins for computing LDS for CIFAR-10 in the quickstart, as well as a colab version of the quickstart in v0.1.3.

Re @psandovalsegura:

Should I use a proj dim of 20480 for reproducing results from the paper?
- to exactly reproduce our results, yes. However, we have empirically observed that using e.g. 4096 gets you most of the way in terms of correlation (LDS).
Currently TRAKer.save_dir for the experiment above (using 20 checkpoints) takes up 120GB. Is this expected?
- this is expected when using a large projection dimension. You can use the del_grads arguments to delete intermediate results (model gradients) if you are space-constrained.
Is there any plan to release the (50000, 10000) scores matrix for TRAK_20 and TRAK_100?
- adding this in v0.1.3; ETA is ~4/17/2023.

Re @xszheng2020: In cifar_quickstart.ipynb, we do not subsample the training set since there is no hyperparameter \alpha?

this is correct, we do this for two reasons: 1) simplicity; 2) showcase that TRAK is not sensitive to using checkpoints from subsampled training runs.

kristian-georgiev commented 1 year ago

@psandovalsegura added a colab with pre-computed TRAK (and a few baselines: datamodels, influences) scores: https://colab.research.google.com/drive/1Mlpzno97qpI3UC1jpOATXEHPD-lzn9Wg?usp=sharing

Note: I chose to upload scores for 200 randomly selected test samples to keep things <50mb, so there are (50000, 200) matrices, not (50000, 10000) ones.

MadryLab / trak

CIFAR-10 Trak Scores Reproduction #22