clip-vil / CLIP-ViL

[ICLR 2022] code for "How Much Can CLIP Benefit Vision-and-Language Tasks?" https://arxiv.org/abs/2107.06383
MIT License
401 stars 35 forks source link

CLIP-VIT-B-Transformer captioning results #20

Closed YuanEZhou closed 2 years ago

YuanEZhou commented 2 years ago

Hi @sIncerass , thanks for your interesting work. I have a question about image captioning results in Table 2. I find that the Transformer model with CLIP-ViT-B feature can still get a good performance instead of the dramatically worse performance reported in Table 2. Maybe there is a bug in the CLIP-ViT-B feature extraction.

sIncerass commented 2 years ago

Hi @YuanEZhou, thanks for pointing this out. Yes, that is due to the resizing bug we fixed in this repo. We will update the manuscript accordingly soon.