What's your setting or more specifically your prompt when you test your model on GQA && POPE && MME dataset?
Which part of data of GQA did you choose to eval?
PS:
I evaluate with MobileVLM_V2-1.7B on GQA testdev-balanced-questions and it's score is about 53.1 which is lower than the result that is about 59.3 in your paper.
Also, avg Acc. on POPE adversarial && popular && random. 80.1 versus 84.3(paper).
Acc. && Acc+. on MME-perception 1128.1( 1302.8 in paper), on MME-all-task 1376.6.
My prompt mode is "v1" and add that sentence to all questions "Please answer that question with one word or phrase".
Temperature=0.2
Anyway, really a great work and I would like to follow it. Looking forward to your reply.
May I ask 2 questions:
PS: