feiyang-cai / osr_vit

19 stars 10 forks source link

What's the OSR performance on TinyImageNet without Pre-training of ViT? #1

Closed Cogito2012 closed 2 years ago

Cogito2012 commented 2 years ago

Hi Feiyang,

Thanks for sharing this interesting work! I have some questions regarding this work.

As pointed out in the paper, the large OSR performance gain on TinyImageNet potentially results from the pre-training of ViT on ImageNet-21K, which actually contains all closed- and open-set data in TinyImageNet. Therefore, it may be a little bit misleading to claim the actual power of ViT for OSR problem, though we believe ViT should work.

It would be better to see what are the actual OSR performances, by training the ViT from scratch on the closed-set only. Besides, it would be interesting to discuss more on the fundamental reasons about why ViT works for OSR, since the recent ICLR'22 work (reference [28] in the paper) already empirically validated the good performance of ViT on image OSR tasks.

Looking forward to discuss with you more :)

feiyang-cai commented 2 years ago

Hi,

Thanks for your interest and question to our work.

Regarding your question, let me explain it through 3 aspects in the following:

  1. The pretraining actually does not bring considerable benefits to the OSR performance. We analyze it in sec. 4.5.2 in our paper. In CIFAR 10 experiment, if we directly use the pretrained ViT model (the CIFAR10 classes are also included in the pretraining dataset ImageNet-21K) to perform the OSR, the AUROC is 87.40%, which is much smaller than the final result 99.5%.

  2. It's not that we don't want to train from scratch, it's just that the pretraining is necessary for ViT training. You can see these two links: https://github.com/lucidrains/vit-pytorch/issues/12, https://github.com/kentaroy47/vision-transformers-cifar10. If the ViT is trained from scratch without pretraining, they cannot get good training on the CIFAR10 dataset. The pretraining is the default setting for the methods using ViT. I believe the ICLR'22 paper you noticed also used a pretrained model to perform their experiments on ImageNet-21K-P. Actually, even in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", when they compare with other SOTA results, they also used pretrained models to perform classification tasks on the downstream datasets, such as CIFAR10. Besides, the pretraining procedure on ImageNet-21K is standard and general, we didn't intentionally use a special dataset to train our model to induce a good performance.

  3. I believe there are some variants of ViT that do not need the pretraining procedure, and it would be very interesting to explore the OSR performance by using these variants without pretraining.

Thanks for your questions again, and feel free to have further discussion and ask any questions regarding our work.

Best, Feiyang