dvlab-research / LLaMA-VID

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
Apache License 2.0
693 stars 43 forks source link

Cannot reproduce Zero-shot Video-QA (MSVD) #6

Closed dcahn12 closed 9 months ago

dcahn12 commented 10 months ago

Thanks for your contribution!

I tried to reproduce your result (Zero-shot VideoQA on MSVD dataset) with the given pretrained weights. (EVA-G & LLaVA1.5-VideoChatGPT-Instruct 7B).

But the result is completely different from your paper. (Reproduced result is shown below) 2023-12-08_00-33-14

Can you check this?

yanwei-li commented 10 months ago

Hi, it seems a huge gap, could you please give more details? We use the this model and this script for evaluation.

dcahn12 commented 10 months ago

I also used the same model (llama-vid-7b-full-224-video-fps-1) and employed the same script to generate responses to relevant questions. However, when evaluating GPT-based evaluations, I simply omitted the "--api_base" argument because it caused an error during the evaluation process. Could you please provide the model-generated answer responses for the MSVD-QA dataset?

yanwei-li commented 9 months ago

Hi, we provide the prediction in pred.json and also GPT3.5 evaluated results in results.json. We also re-evaluate the model, because GPT-based evaluation may have performance bias (give different results at each turn), but still within an acceptable range:

截屏2023-12-08 12 11 25
yanwei-li commented 9 months ago

Close it now, please reopen it if you have further questions.

Felix0805 commented 9 months ago

Hello,can you tell me where can I download the msvd dataset? Thank you very much.