Closed RERV closed 2 years ago
Hi, thanks for the question. Note that the scores shown in Table 10 in the BLIP paper is using the BLIP model further finetuned on COCO-retrieval, while here in our Table 5, the BLIP model is the pre-trained checkpoint. As mentioned earlier in our paper, we use "BLIP_cap" to denote the BLIP checkpoint that is further finetuned on COCO-captioning and "BLIP_vqa" to denote the checkpoint that further finetuned on VQA; otherwise "BLIP" refers to the original pre-trained checkpoint.
Hi, thanks for your great work! I have a problem for the zero-shot text-video retrieval result in the paper. As shown in the first line of Table 5, t2v R@1 is 40.5 and R@5 is 62.8, while in the original BLIP paper, t2v R@1 is 43.3 and R@5 is 65.6, I think there must be several difference, e.g., backbone or etc. Could you tell me why the numbers are different? Thanks!