OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.31k stars 84 forks source link

Performance Reproduction of ViCLIP (on MSRVTT) #106

Closed jpWang closed 5 months ago

jpWang commented 5 months ago

Hi, thanks for the great work(s) and this great repo~

I have a (maybe very beginner) zero-shot performance reproduction question about ViCLIP on MSRVTT.

Based on my understanding, I use the data pieces contained in MSRVTT_JSFUSION_test.csv and calculate all video_features and text_features in it, then directly use the following code to calculate the text2video retrieval top-1 accuracy:

score = text_features @ video_features.T
pred = score.argmax(-1)
accuracy = (pred == torch.arange(len(pred))).sum() / len(pred)

However, I only get the accuracy as 0.3770, which is far behind the reported 0.4240 in the paper. Have I overlooked any important details or made any misunderstandings?

Looking forward to your reply and guidance~

jpWang commented 5 months ago

I also use the method shown in demo to extract video frames:

video = cv2.VideoCapture(video_path)
frames = [x for x in _frame_from_video(video)]
fnum = 8
step = len(frames) // fnum
frames = frames[::step][:fnum]

vid_tube = []
for q in frames:
    now = q[:,:,::-1]
    now = cv2.resize(now, (224, 224))
    now = np.expand_dims(normalize(now), axis=(0, 1))
    vid_tube.append(now) 
vid_tube = np.concatenate(vid_tube, axis=1)
vid_tube = np.transpose(vid_tube, (0, 1, 4, 2, 3))
vid_tube = torch.from_numpy(vid_tube)
siyilingting commented 3 months ago

Hi, thanks for the great work(s) and this great repo~

I have a (maybe very beginner) zero-shot performance reproduction question about ViCLIP on MSRVTT.

Based on my understanding, I use the data pieces contained in MSRVTT_JSFUSION_test.csv and calculate all video_features and text_features in it, then directly use the following code to calculate the text2video retrieval top-1 accuracy:

score = text_features @ video_features.T
pred = score.argmax(-1)
accuracy = (pred == torch.arange(len(pred))).sum() / len(pred)

However, I only get the accuracy as 0.3770, which is far behind the reported 0.4240 in the paper. Have I overlooked any important details or made any misunderstandings?

Looking forward to your reply and guidance~

Hi, I have a similar problem. Did you reproduce the ViCLIP performance on MSR-VTT?

jpWang commented 3 months ago

@siyilingting Hi, I use https://github.com/OpenGVLab/unmasked_teacher for evaluation and finally can get similar results for ViCLIP.

siyilingting commented 3 months ago

Thank you.