Closed bryant1410 closed 7 months ago
Hi @bryant1410 yes the weights of the encoder are indeed frozen during evaluation, but data augmentations are used to train the probe on top of the frozen backbone. This is actually the common practice for probing of vision models (e.g., the popular IN1k linear probe).
You could also just create a "dataset" by going through the dataset several times and computing the embeddings of various augmentations of each training video. This is probably more closely related to your suggestion, which would be valuable in terms of efficiency of training different types of probes, or exploring hyper-parameters. This is not a priority right now, but certainly something we can look at adding in the future!
Oh, I didn't know it was common practice for some (e.g., LG-SSL doesn't do this).
Feel free to close this issue or leave it open.
Shouldn't the frozen evaluation not use augmentations?
After looking at the code and reading the paper, I see that you apply random augmentations when computing the embeddings. The weights of the encoder are frozen. However, the evaluation is not frozen; a video will get different embeddings in different epochs. This is a bit misleading because these results wouldn't apply if I pre-extract the embeddings (if they are really frozen).
I think it'd be nice to see the performance of your models in such a setting. It's just a suggestion of something I believe others and I would find useful, but I understand if you can't do it for any reason.
(I assume the baselines do the same with the augmentations -- still my concern applies)