MikeWangWZHL / VidIL

Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
MIT License
112 stars 1 forks source link

QA about the text-video retrieval result in the paper. #7

Closed RERV closed 2 years ago

RERV commented 2 years ago

Hi, thanks for your great work! I have a problem for the zero-shot text-video retrieval result in the paper. As shown in the first line of Table 5, t2v R@1 is 40.5 and R@5 is 62.8, image while in the original BLIP paper, image t2v R@1 is 43.3 and R@5 is 65.6, I think there must be several difference, e.g., backbone or etc. Could you tell me why the numbers are different? Thanks!

MikeWangWZHL commented 2 years ago

Hi, thanks for the question. Note that the scores shown in Table 10 in the BLIP paper is using the BLIP model further finetuned on COCO-retrieval, while here in our Table 5, the BLIP model is the pre-trained checkpoint. As mentioned earlier in our paper, we use "BLIP_cap" to denote the BLIP checkpoint that is further finetuned on COCO-captioning and "BLIP_vqa" to denote the checkpoint that further finetuned on VQA; otherwise "BLIP" refers to the original pre-trained checkpoint.