bfshi / scaling_on_scales

When do we not need larger vision models?
MIT License
277 stars 9 forks source link

Question about the VIT baseline for imagenet classification #11

Closed mingkai-zheng closed 1 month ago

mingkai-zheng commented 1 month ago

Thank you for the nice work. I have a few questions regarding the ViT baseline for ImageNet classification results presented in Table 5. The table shows that the ViT-based model achieves an 80.3 Top-1 Accuracy. I'm wondering where this baseline comes from? I was under the impression that an 81.8 Top-1 Accuracy from the DeiT paper is more commonly referenced.

Additionally, the MAE paper (Table 3) reports that the Large and Huge models achieve 82.6 and 83.1 Top-1 Accuracy (without pretraining), respectively. However, Table 5 in this paper shows 81.6 and 77.3 for these models. Am I missing something here?

bfshi commented 1 month ago

Hi @mingkai-zheng, thanks for the interest in our work. For the ViT ImageNet classification experiments, we use google's pre-trained checkpoints from the original ViT paper: https://huggingface.co/google/vit-base-patch16-224-in21k https://huggingface.co/google/vit-large-patch16-224-in21k https://huggingface.co/google/vit-huge-patch14-224-in21k which may have different performance than DeiT checkpoints. From our experiment setting, we get 80.3, 81.6, and 77.3 linear probe accuracy from these three models, respectively.

mingkai-zheng commented 1 month ago

Thank you very much for your response! I have an additional question regarding the classification experiment: Is the final linear layer directly trained on top of the [16, 16, 1536] feature map? Did you use the CLS token, or did you apply average pooling on the feature map?

bfshi commented 1 month ago

Yes. we used CLS token for classification. When dealing with multi-scale features, at each scale, we take the average of the CLS tokens of each sub-image as the CLS token for that scale, and then concatenate the CLS tokens from all scales as the final CLS token. For example, for an image with two scales, 224x224 and 448x448, the ViT gives a CLS token at scale 224x224, and at scale of 448x448, since the image is split to four 224x224 sub-images, the ViT processes each sub-image and gives a CLS token, and we take the average of these four CLS tokens as the CLS token for 448x448 scale. Finally we concatenate the CLS token from 224x224 scale and 448x448 scale.

mingkai-zheng commented 1 month ago

Thank you so much for your detailed response.