Hi @sIncerass , thanks for your interesting work. I have a question about image captioning results in Table 2. I find that the Transformer model with CLIP-ViT-B feature can still get a good performance instead of the dramatically worse performance reported in Table 2. Maybe there is a bug in the CLIP-ViT-B feature extraction.
Hi @YuanEZhou, thanks for pointing this out. Yes, that is due to the resizing bug we fixed in this repo. We will update the manuscript accordingly soon.
Hi @sIncerass , thanks for your interesting work. I have a question about image captioning results in Table 2. I find that the Transformer model with CLIP-ViT-B feature can still get a good performance instead of the dramatically worse performance reported in Table 2. Maybe there is a bug in the CLIP-ViT-B feature extraction.