Open jayten-jeon opened 10 months ago
I speculate that the changes may be due to the version of the GPT, but it shouldn't have changed so drastically. I would suggest you try non-GPT evaluations first, e.g. scienceqa, textqa, gqa to rule out model configuration issues.
I also found that on TGIF_Zero_Shot_QA task, I get a result of accuracy: 0.50, score: 3.2013, while in the paper it is accuracy: 70.0, score: 4.0. I have evaluated the model on textvqa and the result is 51.8 which is exactly the number reported in the paper, making sure the model and configuration is alright. I believe that the update of gpt-3.5-turbo has a big influence on the gpt-assisted evaluation so the result is not stable. making it hard to reproduce
Hi, everyone. In the feedback, https://github.com/PKU-YuanGroup/Video-LLaVA/issues/37#issue-2032217679 succeed on MSVD,MSRVTT,ACTIVITYNET but fail on TGIF. https://github.com/PKU-YuanGroup/Video-LLaVA/issues/36#issue-2031834153 failed on MSVD.
Also I have observed similar problems in other work. https://github.com/mbzuai-oryx/Video-ChatGPT/issues/28
Maybe we should find some more stable non-GPT evaluation method.
I have tested on GQA and POPE and achieved the same results, and tested on MM-Bench and received similar results (60.9 in paper vs. 61.5 in my results). However, testing LLaVA-Bench, after updating the GPT version from gpt-4-0314 to gpt-4-0613, returned me scores of
Video-LLaVA-7B
all 55.7 91.2 50.8
llava_bench_complex 71.3 87.9 62.7
llava_bench_conv 45.1 95.3 42.9
llava_bench_detail 40.3 92.7 37.3
so, I believe it is an issue with the GPT method being used. I believe using an open source model instead of GPT-3.5 or 4 would allow for reproducibility.
I tried running the evaluation code on your model checkpoint, but I cannot reproduce the results noted in your paper.
Can you help me with this?
Below are the results that I got.
Below are the results from the paper.