Evaluation results on MVBench different from the paper

OpenGVLab / Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

https://vchat.opengvlab.com/

MIT License

2.86k stars 230 forks source link

Evaluation results on MVBench different from the paper #90

Open emmating12 opened 6 months ago

emmating12 commented 6 months ago

Hi, I have tested the VideoChat2 model on my server and found that the test results are different from the paper. My results are listed as follows: {"Action Sequence": 66.0, "Action Prediction": 47.5, "Action Antonym": 83.5, "Fine-grained Action": 49.5, "Unexpected Action": 60.0, "Object Existence": 57.99999999999999, "Object Interaction": 71.5, "Object Shuffle": 41.5, "Moving Direction": 23.0, "Action Localization": 22.5, "Scene Transition": 88.5, "Action Count": 39.5, "Moving Count": 42.0, "Moving Attribute": 58.5, "State Change": 44.0, "Fine-grained Pose": 49.0, "Character Order": 36.5, "Egocentric Navigation": 35.0, "Episodic Reasoning": 38.5, "Counterfactual Inference": 65.0, "Avg": 50.975} The results for OS, AL, AC, ER, and CI are different. Could you help me find the reasons?

Andy1621 commented 6 months ago

Hi! Could you provide your environment list, like torch and CUDA version?

emmating12 commented 6 months ago

Hi, python=3.10.13, torch=1.13.1+cu117, torchvision=0.14.1+cu117, cuda=11.7.

Andy1621 commented 6 months ago

For me, code is run at A100 with

Python=3.7.12
cuda=11.7
torch=1.13.1+cu117
torchvision=0.14.1+cu117

emmating12 commented 5 months ago

For me, code is run at A100 with
Python=3.7.12
cuda=11.7
torch=1.13.1+cu117
torchvision=0.14.1+cu117

Hi, I have tested the VideoChat2 model on A100, python=3.8, torch=1.13.1+cu117, torchvision=0.14.1+cu117, cuda=11.7. The result for "Episodic Reasoning" is 38.5% different from the paper. The other results are the same. Could you help me find the reasons?

Andy1621 commented 5 months ago

Hi! I think the reason is that you use the old version of the inference code. In the new version, I set True to use the temporal boundary, which improves the results slightly.

chenxshuo commented 3 months ago

@emmating12 Hi, we have the same reproduction results. Did you find a way to reproduce the performance on Episodic Reasoning?

@Andy1621 Thanks for the info. I used the mvbench.ipynb with the True for Episodic Reasoning but the performance is still 38.5% instead of 40.5%. Do you have any other suggestions?

Andy1621 commented 3 months ago

Hi! I'm not sure whether you have inferred the model correctly.

Originally when I tested MVBench, I forgot to use start and end for TVQA, thus achieving 38.5% as yours.

But when I fixed the bug and used start and end (setting True), the result increased as expected, obtaining 40.5% .