BradyFU / Video-MME

โœจโœจVideo-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
401 stars 12 forks source link
large-language-models large-vision-language-models mme multimodal-large-language-models video video-mme

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

VideoQA Multi-Modal Video-MME
Gemini GPT-4V GPT-4o

[๐ŸŽ Project Page] [๐Ÿ“– arXiv Paper] [๐Ÿ“Š Dataset][๐Ÿ† Leaderboard]

Video-MME applies to both image MLLMs, i.e., generalizing to multiple images, and video MLLMs. ๐ŸŒŸ


๐Ÿ”ฅ News

๐Ÿ‘€ Video-MME Overview

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements, but their potential in processing sequential visual data is still insufficiently explored. We introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. It is designed to comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities. Video-MME comprises 900 videos with a total of 254 hours, and 2,700 human-annotated question-answer pairs. Our work distinguishes from existing benchmarks through four key features:

๐Ÿ“ Dataset Examples

Click to expand more examples

๐Ÿ” Dataset

License:

Video-MME is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.
If there is any infringement in Video-MME, please email videomme2024@gmail.com and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify Video-MME in whole or in part. 
You must strictly comply with the above restrictions.

Please send an email to videomme2024@gmail.com. ๐ŸŒŸ

๐Ÿ”ฎ Evaluation Pipeline

๐Ÿ“ Extract Frames and Subtitles:

There are a total of 900 videos and 744 subtitles, where all long videos have subtitles.

With respect to the setting of adding subtitles, you should only use the subtitles corresponding to the sampled video frames. For example, if you extract 10 frames per video for evaluation, take the 10 subtitles that corresponding to the time of those 10 frames.

If you have already prepared the video and subtitle file, you could refer to this script to extract the frames and corresponding subtitles.

๐Ÿ“ Prompt:

The common prompt used in our evaluation follows this format:

This video's subtitles are listed below:
[Subtitles] 
Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option. 
[Question]
The best answer is:

For the subtitles-free setting, you should remove the subtitle content.

Click to expand the prompt examples. * With subtitles: ``` This video's subtitles are listed below: Hi guys, I'm going to show you how to perfectly prepare a ... Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option. What is the color of the clothing worn by the persons in the video? A. Black. B. Gray. C. Green. D. Brown. The best answer is: ``` * Without subtitles: ``` Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option. What is the color of the clothing worn by the persons in the video? A. Black. B. Gray. C. Green. D. Brown. The best answer is: ```

๐Ÿ“ Evaluation:

To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template output_test_template.json. Once you have prepared the model responses in this format, please refer to the evaluation script eval_your_results.py, and you will get the accuracy scores across video_durations, video domains, video subcategories, and task types. The evaluation does not introduce any third-party models, such as ChatGPT.

python eval_your_results.py \
    --results_file $YOUR_RESULTS_FILE \
    --video_duration_type $VIDEO_DURATION_TYPE \
    --return_categories_accuracy \
    --return_sub_categories_accuracy \
    --return_task_types_accuracy

Please ensure that the results_file follows the specified JSON format stated above, and video_duration_type is specified as either short, medium, or long. If you wish to assess results across various duration types, you can specify multiple types separated by commas or organize them in a list, for example: short,medium,long or ["short","medium","long"].

๐Ÿ“ Leaderboard:

If you want to add your model to our leaderboard, please send model responses to bradyfu24@gmail.com, as the format of output_test_template.json.

๐Ÿ“ˆ Experimental Results

:black_nib: Citation

If you find our work helpful for your research, please consider citing our work.

@article{fu2024video,
  title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
  author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
  journal={arXiv preprint arXiv:2405.21075},
  year={2024}
}

๐Ÿ“œ Related Works

Explore our related researches: