Closed JinhuiYE closed 2 months ago
Hi, thanks for the amazing benchmark.
I have some questions regarding the https://github.com/thanku-all/parse_answer/blob/main/eval_your_results.py script.
In the script, you check whether each duration has 300 questions:
duration
assert len(your_results_video_type) == 300, f"Number of files in {video_type} is not 300. Check if there are missing files."
However, in the dataset "https://huggingface.co/datasets/lmms-lab/Video-MME", there are 2700 questions—900 for each duration (i.e., short, medium, long).
How can I ensure consistency between my results and the reported results?
oh, I got it. This is due to the index structures of these two files are different.
Hi, thanks for the amazing benchmark.
I have some questions regarding the https://github.com/thanku-all/parse_answer/blob/main/eval_your_results.py script.
In the script, you check whether each
duration
has 300 questions:However, in the dataset "https://huggingface.co/datasets/lmms-lab/Video-MME", there are 2700 questions—900 for each duration (i.e., short, medium, long).
How can I ensure consistency between my results and the reported results?