OpenGVLab / MMIU

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
https://mmiu-bench.github.io/
55 stars 2 forks source link

Best Practice

We strongly recommend using VLMEevalKit for its useful features and ready-to-use LVLM implementations.

MMIU

Quick Start | HomePage | arXiv | Dataset | Citation

This repository is the official implementation of MMIU.

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng*, Jin Wang*, Chuanhao Li*, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang#, Wenqi Shao#
* MFQ, WJ and LCH contribute equally.
# SWQ (shaowenqi@pjlab.org.cn) and ZKP (zhangkaipeng@pjlab.org.cn) are correponding authors.

💡 News

Introduction

Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. overview

Evaluation Results Overview

🏆 Leaderboard

Rank Model Score
1 GPT4o 55.72
2 Gemini 53.41
3 Claude3 53.38
4 InternVL2 50.30
5 Mantis 45.58
6 Gemini1.0 40.25
7 internvl1.5-chat 37.39
8 Llava-interleave 32.37
9 idefics2_8b 27.80
10 glm-4v-9b 27.02
11 deepseek_vl_7b 24.64
12 XComposer2_1.8b 23.46
13 deepseek_vl_1.3b 23.21
14 flamingov2 22.26
15 llava_next_vicuna_7b 22.25
16 XComposer2 21.91
17 MiniCPM-Llama3-V-2_5 21.61
18 llava_v1.5_7b 19.19
19 sharegpt4v_7b 18.52
20 sharecaptioner 16.10
21 qwen_chat 15.92
22 monkey-chat 13.74
23 idefics_9b_instruct 12.84
24 qwen_base 5.16
- Frequency Guess 31.5
- Random Guess 27.4

🚀 Quick Start

Here, we mainly use the VLMEvalKit framework for testing, with some separate tests as well. Specifically, for multi-image models, we include the following models:

transformers == 33.0

transformers == 37.0

transformers == 40.0

For single-image models, we include the following:

transformers == 33.0

transformers == 37.0

transformers == 40.0

We use the VLMEvalKit framework for testing. You can refer to the code in VLMEvalKit/test_models.py. Additionally, for closed-source models, please replace the following part of the code by following the example here:

response = model.generate(tmp) # tmp = image_paths + [question]

For other open-source models, we have provided reference code for Mantis and InternVL1.5-chat. For LLava-Interleave, please refer to the original repository.

💐 Acknowledgement

We expressed sincerely gratitude for the projects listed following:

📧 Contact

If you have any questions, feel free to contact Fanqing Meng with mengfanqing33@gmail.com

🖊️ Citation

If you feel MMIU useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!

@article{meng2024mmiu,
  title={MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models},
  author={Meng, Fanqing and Wang, Jin and Li, Chuanhao and Lu, Quanfeng and Tian, Hao and Liao, Jiaqi and Zhu, Xizhou and Dai, Jifeng and Qiao, Yu and Luo, Ping and others},
  journal={arXiv preprint arXiv:2408.02718},
  year={2024}
}