MikeWangWZHL / VidIL

Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
MIT License
112 stars 1 forks source link

Computing BLIP baseline for 4 metrics on MSRVTT caption #4

Closed AnaRhisT94 closed 2 years ago

AnaRhisT94 commented 2 years ago

Hi, According to the reported baselines of BLIP and BLIP_cap: image

I'm trying to understand, and also find in the code how you computed this baseline on the 4 metrics. According to the paper, Section 4.2, you wrote that you stitch multiple frames and compute the loss. But I'm not sure I understand how it's done (and where is it implemented in the code).

Any help is highly appreciated!

Thanks a lot.

AnaRhisT94 commented 2 years ago

After testing the code of run_video_CapFilt.py I got the following results: Blip_cap 21.46, 0.476, 0.222, 0.289 I'm trying to find the code of BLIP and see how to reproduce the results for Cider of 39.5~ and BLEU-4 of 27.7~

MikeWangWZHL commented 2 years ago

The code in run_video_CapFilt.py is for generating frame captions, not for this video captioning baseline; You can find the code for training the few-shot BLIP baseline for video-captioning at https://github.com/MikeWangWZHL/VidIL/blob/main/train_caption_video.py; There is also a example train_caption_video.sh file in scripts/. It is modified from the original BLIP video eval code here; As mentioned in our paper, instead of only do evaluation, we further train BLIP using this contatednated features from few-shot traning samples.

image