Can not reproduce LLaVA-Bench.

PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

https://arxiv.org/pdf/2311.10122.pdf

Apache License 2.0

2.98k stars 216 forks source link

Can not reproduce LLaVA-Bench. #55

Open jayten-jeon opened 10 months ago

jayten-jeon commented 10 months ago

I tried running the evaluation code on your model checkpoint, but I cannot reproduce the results noted in your paper.

Can you help me with this?

Below are the results that I got.

Video-LLaVA-7B
all 56.2 86.2 48.4
llava_bench_complex 71.8 85.4 61.2
llava_bench_conv 45.6 85.9 39.1
llava_bench_detail 39.8 88.0 35.0

Below are the results from the paper.

LinB203 commented 10 months ago

I speculate that the changes may be due to the version of the GPT, but it shouldn't have changed so drastically. I would suggest you try non-GPT evaluations first, e.g. scienceqa, textqa, gqa to rule out model configuration issues.

Leo-Yuyang commented 10 months ago

I also found that on TGIF_Zero_Shot_QA task, I get a result of accuracy: 0.50, score: 3.2013, while in the paper it is accuracy: 70.0, score: 4.0. I have evaluated the model on textvqa and the result is 51.8 which is exactly the number reported in the paper, making sure the model and configuration is alright. I believe that the update of gpt-3.5-turbo has a big influence on the gpt-assisted evaluation so the result is not stable. making it hard to reproduce

LinB203 commented 9 months ago

Hi, everyone. In the feedback, https://github.com/PKU-YuanGroup/Video-LLaVA/issues/37#issue-2032217679 succeed on MSVD,MSRVTT,ACTIVITYNET but fail on TGIF. https://github.com/PKU-YuanGroup/Video-LLaVA/issues/36#issue-2031834153 failed on MSVD.

Also I have observed similar problems in other work. https://github.com/mbzuai-oryx/Video-ChatGPT/issues/28

Maybe we should find some more stable non-GPT evaluation method.

NicholasCG commented 6 months ago

I have tested on GQA and POPE and achieved the same results, and tested on MM-Bench and received similar results (60.9 in paper vs. 61.5 in my results). However, testing LLaVA-Bench, after updating the GPT version from gpt-4-0314 to gpt-4-0613, returned me scores of

Video-LLaVA-7B
all 55.7 91.2 50.8
llava_bench_complex 71.3 87.9 62.7
llava_bench_conv 45.1 95.3 42.9
llava_bench_detail 40.3 92.7 37.3

so, I believe it is an issue with the GPT method being used. I believe using an open source model instead of GPT-3.5 or 4 would allow for reproducibility.