Closed lloo099 closed 1 year ago
The cifar100 task in vtab1k benchmark only uses 1000 training images, while the paper of adaptformer shows results on full cifar100 dataset with 50000 training images. We also provide results on full cifar100 in Table 5 and dataset details in Table 9.
Thanks for the prompt reply. This is an interesting topic, also I see that full cifar100 results in Table 5 can be trained up to 93.95. Then I replicated the adaptformer training 100 epochs with batch size equal to 128 before and the results were only around 85.8. What is the reason for this?
In Table 9, you mentioned 60,000 training images and 10,000 test images. But from the original, there are 50,000 training images and 10,000 test images. The CIFAR-100 dataset consists of 60,000 images, divided into 100 classes. Each class contains 600 images, split into 500 for training and 100 for testing. https://www.cs.toronto.edu/~kriz/cifar.html
Accroding to the original article of adaptformer, the result of fine-tuning with supervised pre-trained ViT-B on full cifar100 can be up to 91.86 (table 6 in https://arxiv.org/pdf/2205.13535.pdf).
'60,000 training images' is a typo. It should be 50,000. Thank you for pointing this out.
ic, thanks for your answer!
I'm curious that your cifar100 is less accurate than the results provided in the original article. As you explain in the article you used a vit model and fine-tuned it on the downstream dataset. The adaptformer in your paper gives an accuracy result of 73.8 on the cifar100 dataset, whereas the original article's adaptformer can go up to 83.52 at h=1. Wondering why your training method is so much worse? Thanks