PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
3.03k stars 220 forks source link

Can not reproduce the results on MSVD-QA and TGIF-QA #197

Open Jingchensun opened 3 weeks ago

Jingchensun commented 3 weeks ago

First, thank you for the amazing work.

I am using the checkpoint LanguageBind/Video-LLaVA-7B and have set do_sample=False and temperature=0.0 in run_inference_video_qa.py. For inference on the MSVD-QA dataset, I used 4 * A6000 GPUs, and the process took about one hour. However, when I evaluated the prediction results using GPT-3.5 (default setting), I only achieved an accuracy of 36.27% and a score of 2.87, which is significantly lower than the results reported in the paper. Similarly, on the TGIF-QA dataset, I obtained an accuracy of only 19.6% and a score of 2.4. For other evaluation tasks, such as VQA (e.g., VQAv2, GQA), my results perfectly matched those reported in the paper.

Could the authors provide feedback on the evaluation for the video-QA task? Is there an alternative evaluation method to using GPT?


with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=[video_tensor],
        do_sample=False,
        temperature=0.0,
        max_new_tokens=1024,
        use_cache=True,
        stopping_criteria=[stopping_criteria]
    )