facebookresearch / jepa

PyTorch code and models for V-JEPA self-supervised learning from video.
Other
2.63k stars 251 forks source link

Augmentations in the frozen evaluation #24

Closed bryant1410 closed 7 months ago

bryant1410 commented 7 months ago

Shouldn't the frozen evaluation not use augmentations?

After looking at the code and reading the paper, I see that you apply random augmentations when computing the embeddings. The weights of the encoder are frozen. However, the evaluation is not frozen; a video will get different embeddings in different epochs. This is a bit misleading because these results wouldn't apply if I pre-extract the embeddings (if they are really frozen).

I think it'd be nice to see the performance of your models in such a setting. It's just a suggestion of something I believe others and I would find useful, but I understand if you can't do it for any reason.

(I assume the baselines do the same with the augmentations -- still my concern applies)

MidoAssran commented 7 months ago

Hi @bryant1410 yes the weights of the encoder are indeed frozen during evaluation, but data augmentations are used to train the probe on top of the frozen backbone. This is actually the common practice for probing of vision models (e.g., the popular IN1k linear probe).

You could also just create a "dataset" by going through the dataset several times and computing the embeddings of various augmentations of each training video. This is probably more closely related to your suggestion, which would be valuable in terms of efficiency of training different types of probes, or exploring hyper-parameters. This is not a priority right now, but certainly something we can look at adding in the future!

bryant1410 commented 7 months ago

Oh, I didn't know it was common practice for some (e.g., LG-SSL doesn't do this).

Feel free to close this issue or leave it open.