Open Leo-Yuyang opened 1 month ago
Can you reproduce the results of other models? Recently, one paper revealed that the GPT version affects the final results.
I have checked the history and we use gpt-3.5-turbo
by default. Considering the testing time, we may use gpt-3.5-turbo-1106
and the results are as follows:
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for correctness: 3.020541082164329
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for detailed orientation: 2.875250501002004
completed_files: 1996
incomplete_files: 0
All evaluation completed!
Average score for contextual understanding: 3.509018036072144
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score temporal understanding: 2.661322645290581
completed_files: 499
incomplete_files: 0
All evaluation completed!
Average score for consistency: 2.8076152304609217
Dear author, I found that in your paper, you claimed a very impressive performance on videochatgpt video benchmark. However, I didn't find related code about reproducing this experiment. So I modified the mvbench evaluation code to inference on this task. But I can't reproduce it.
What I got is:
The model is the right one because I used the same model and reproduced the performance on MVbench.
So the only difference might be the different prompt used when testing the dataset, however I can't really believe that a different prompt can lead to such a big difference.
So could you please give me the prompt used when testing this benchmark or the inference result of this? Extraordinary claims require extraordinary evidence.