add Conbench - Githubissues

This PR adds ConBench as an additional benchmark about Consistency.

When faced with prompts in different sizes of solution spaces, Large vision-language models (LVLMs) fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point.

ConScore[D]

Rank	Teacher	ConScore[D]
1	Qwen-VL-Max	37.00
2	GPT-4-Omni	35.70
3	InternVL-v1.2P-40B	34.70
4	Gemini-Ultra-Vision	33.10
5	InternVL-v1.5-26B	31.40

EvolvingLMMs-Lab / lmms-eval

add Conbench #100

ConScore[D]