Open lanalex opened 5 days ago
In Table 4, page 12 of the Dinov2 paper, the top 1% accuracy number for Dino(v1) using a linear probe is 79.2. On the other hand, in Table 2 of the I-JEPA paper the top 1% accuracy for Dino(v1) is 70. This discrepancy could possibly be due to I-IEPA reporting results at 300 epochs and Dinov2 reporting results at 500 epochs.
My question is the following. At 300 epochs I-JEPA produces a top 1% accuracy of 77.3 which is 7.3% greater than Dino(v1). If you run I-JEPA to 500 epochs, would it still be greater than Dino(v1) by 7.3%?
Another question is: What happens if you train I-JEPA with the LVD-142M dataset for 500 epochs? Would it be superior to Dinov2?
Hello,
Conceptually both Dinov2 and IJEPA provide latent space representation of images. Dinov2 relies heavily on augmentation and data views generation, while IJEPA doesn't. So far as I can see, the primary advantage of the pretrained weights of DINOV2 is that it was trained on WAY more images. Why did facebook choose to do it on Dinov2 and not on IJEPA architecture? Are there advantages to the first? Are there benchmarks comparing both using the same initial unsupervised training set.
Thank you!