PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
3.04k stars 220 forks source link

Cannot reproduce Zero-shot Video-QA (MSVD) #34

Closed dcahn12 closed 12 months ago

dcahn12 commented 12 months ago

Thanks for your contribution!

I tried to reproduce your result (Zero-shot VideoQA on MSVD dataset) with the pretrained weight https://huggingface.co/LanguageBind/Video-LLaVA-7B/tree/main.

But the result is completely different from your paper. (Reproduced result is shown below) 2023-12-07_20-16-35

Can you check this?

xmy0916 commented 12 months ago

@dcahn12 how about other video benchmarks ?

LinB203 commented 12 months ago

Refer to this issue.

dcahn12 commented 12 months ago

@xmy0916 I only tested MSVD Video-QA.

xmy0916 commented 10 months ago

Thanks for your contribution!

I tried to reproduce your result (Zero-shot VideoQA on MSVD dataset) with the pretrained weight https://huggingface.co/LanguageBind/Video-LLaVA-7B/tree/main.

But the result is completely different from your paper. (Reproduced result is shown below) 2023-12-07_20-16-35

Can you check this?

My test results on MSVD:

Yes count: 4041
No count: 9116
Accuracy: 0.30713688530820094
Average score: 2.726077373261382