MMStar-Benchmark / MMStar

This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
https://mmstar-benchmark.github.io
135 stars 2 forks source link

Qwen-VL-Chat doesn't follow prompt #7

Open RifleZhang opened 4 months ago

RifleZhang commented 4 months ago

Following set up in https://github.com/open-compass/VLMEvalKit/tree/main for Qwen set up.

Qwen-VL-Chat directly outputs answer instead of a letter choice. Did you use any customized prompt or did post processing of model responses?

RifleZhang commented 4 months ago

Using the default prompt in https://github.com/open-compass/VLMEvalKit/tree/main , I got 22.26 for Qwen-VL-Chat. Using different prompt or evaluation post-processing method can lead to large variance. Similar for Deepseek_vl_7b, I got 26.86 with the LLava-Next prompt provided, and 32.6 with the default prompt in VLMEvalKit.

Is there an evaluation pipeline for other models reported in the paper. I found it hard to replicate the exact number without the exact prompt.