facebookresearch / ijepa

Official codebase for I-JEPA, the Image-based Joint-Embedding Predictive Architecture. First outlined in the CVPR paper, "Self-supervised learning from images with a joint-embedding predictive architecture."
Other
2.75k stars 335 forks source link

Question about dinov2 vs ijepa #66

Open lanalex opened 5 days ago

lanalex commented 5 days ago

Hello,

Conceptually both Dinov2 and IJEPA provide latent space representation of images. Dinov2 relies heavily on augmentation and data views generation, while IJEPA doesn't. So far as I can see, the primary advantage of the pretrained weights of DINOV2 is that it was trained on WAY more images. Why did facebook choose to do it on Dinov2 and not on IJEPA architecture? Are there advantages to the first? Are there benchmarks comparing both using the same initial unsupervised training set.

Thank you!

bumi001 commented 3 days ago

In Table 4, page 12 of the Dinov2 paper, the top 1% accuracy number for Dino(v1) using a linear probe is 79.2. On the other hand, in Table 2 of the I-JEPA paper the top 1% accuracy for Dino(v1) is 70. This discrepancy could possibly be due to I-IEPA reporting results at 300 epochs and Dinov2 reporting results at 500 epochs.

My question is the following. At 300 epochs I-JEPA produces a top 1% accuracy of 77.3 which is 7.3% greater than Dino(v1). If you run I-JEPA to 500 epochs, would it still be greater than Dino(v1) by 7.3%?

Another question is: What happens if you train I-JEPA with the LVD-142M dataset for 500 epochs? Would it be superior to Dinov2?