Cannot reproduce videochatgpt video benchmark

Leo-Yuyang commented 1 month ago

Dear author, I found that in your paper, you claimed a very impressive performance on videochatgpt video benchmark. However, I didn't find related code about reproducing this experiment. So I modified the mvbench evaluation code to inference on this task. But I can't reproduce it.

What I got is: The model is the right one because I used the same model and reproduced the performance on MVbench. So the only difference might be the different prompt used when testing the dataset, however I can't really believe that a different prompt can lead to such a big difference. So could you please give me the prompt used when testing this benchmark or the inference result of this? Extraordinary claims require extraordinary evidence.

Andy1621 commented 1 month ago

Can you reproduce the results of other models？ Recently, one paper revealed that the GPT version affects the final results.

Andy1621 commented 1 month ago

I have checked the history and we use gpt-3.5-turbo by default. Considering the testing time, we may use gpt-3.5-turbo-1106 and the results are as follows:

completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for correctness: 3.020541082164329
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for detailed orientation: 2.875250501002004
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for contextual understanding: 3.509018036072144
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score temporal understanding: 2.661322645290581
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score for consistency: 2.8076152304609217

OpenGVLab / Ask-Anything

Cannot reproduce videochatgpt video benchmark #175