Open yellow-binary-tree opened 1 month ago
Thank you for your attention on our work. For qwen_vl, their office code uses ppl for A/B/C/D method to evaluate. For llava 1.5, their code uses generate method to evaluate. So I have modified corresponding code for ppl evaluation method. But for InternLM_Xcomposer_VL, their evaluation code is ppl evaluation method. Hence, I just provide InternLM_Xcomposer_VL evaluation code based on their code.
In
SEED-Bench-2/model/InternLM_Xcomposer_VL_interface.py
, for InternLM_Xcomposer_VL model all choices are added to model input and choice letters ("A.", "B.", "C.", "D.") are used as labels to calculate loss. While for all other models (instructblip, qwen_vl, llava_v2), in their interface code we can see only the question is added to model input, and the text of each choice is used as labels to calculate loss independently.I wonder why do you use different input format for different models? Will this have large impact on accuracy?