EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.02k stars 52 forks source link

add Conbench #100

Closed Gumpest closed 3 weeks ago

Gumpest commented 3 weeks ago

This PR adds ConBench as an additional benchmark about Consistency.

When faced with prompts in different sizes of solution spaces, Large vision-language models (LVLMs) fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point.

ConScore[D]

Rank Teacher ConScore[D]
1 Qwen-VL-Max 37.00
2 GPT-4-Omni 35.70
3 InternVL-v1.2P-40B 34.70
4 Gemini-Ultra-Vision 33.10
5 InternVL-v1.5-26B 31.40
Luodian commented 3 weeks ago

Thanks for this PR, it's pretty clear.