232525 / PureT

Implementation of 'End-to-End Transformer Based Model for Image Captioning' [AAAI 2022]
63 stars 12 forks source link

Swin Transformer pre-trained Model? #1

Open JingyuLi-code opened 2 years ago

JingyuLi-code commented 2 years ago

Thanks for your work! I want to know what pre-trained model do you use? ImageNet-1K or Pre-training on ImageNet-22K and fine-tuning on ImageNet-1K, have you compared that? It would be better source code will be released.

232525 commented 2 years ago
  1. pre-trained model: The latter. We adopted the Swin-L 1k model (which have 384x384 of input size and 12 of window size) of "ImageNet-22K pre-trained models" in https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md. image Actually, we have not compared the influence of different Swin backbones.

  2. code release: As soon as possible. Maybe I need to ask my supervisor.

JingyuLi-code commented 2 years ago

Thanks! Actually, I want to know the result of using "Regular ImageNet-1K trained models" and "ImageNet-22K pre-trained models" . Because for region feature and grid feature are pre-trained on Regular ImageNet-1K. ImageNet-22K contains more images than ImageNet-1K, will it cause a huge difference?

232525 commented 2 years ago

The difference must exist, but whether huge needs experimental verification. It seems that SwinTransformer did not release the Swin-L model pre-trained on Regular ImageNet-1K. I am running a simple experiment to train our PureT using Swin-B (input size of 384x384 and window size of 12) backbone pre-trained on Regular ImageNet-1K, the result may need a couple of days.

JingyuLi-code commented 2 years ago

Hi, thanks for your reply! I want to know the result of using the backbone pre-trained on Regular ImageNet-1K Swin-B.

232525 commented 2 years ago

The result is bad, even worse than using Bottom-Up region features under XE loss, so I didn't continue to train it under SCST. Abnormal! I guess maybe there were some mistakes or need some modification of the training process, but I do have not enough free time to do this now (I will try when I am free later). I have released the code, maybe you can re-train it. But I have trained the model using ImageNet-22K Swin-B, the result is normal (B1: 81.3, B2: 66.3, B3: 52.0, B4: 39.9, M: 29.9, R: 59.6, C: 136.6, S: 23.8).