AILab-CVC / SEED-Bench

(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.
Other
276 stars 8 forks source link

Question about evaluation input format #27

Open yellow-binary-tree opened 1 month ago

yellow-binary-tree commented 1 month ago

In SEED-Bench-2/model/InternLM_Xcomposer_VL_interface.py, for InternLM_Xcomposer_VL model all choices are added to model input and choice letters ("A.", "B.", "C.", "D.") are used as labels to calculate loss. While for all other models (instructblip, qwen_vl, llava_v2), in their interface code we can see only the question is added to model input, and the text of each choice is used as labels to calculate loss independently.

I wonder why do you use different input format for different models? Will this have large impact on accuracy?

Bohao-Lee commented 1 month ago

Thank you for your attention on our work. For qwen_vl, their office code uses ppl for A/B/C/D method to evaluate. For llava 1.5, their code uses generate method to evaluate. So I have modified corresponding code for ppl evaluation method. But for InternLM_Xcomposer_VL, their evaluation code is ppl evaluation method. Hence, I just provide InternLM_Xcomposer_VL evaluation code based on their code.