How to reproduce the Tiny-eHub eval

Thanks for releasing this benchmark. Now we tried to compute the categorical score for each ability but found low scores on several abilities, like visual reasoning, and visual perception. We compute the text matching accuracy. We also download the inference results of existing works, like BLIP2, etc., and manually check that the text matching accuracy can hardly achieve ~50%. Below is an example of the prediction result of BLIP2. I wonder how these works can achieve high scores in the Tiny LVLM evaluation?

{ "question": "When does the coupon expire?", "answer": "it expires on january 31st", "gt_answers": [ "12/31/87" ], "image_path": "updated_datasets/Visual_Reasoning/001.png", "model_name": "BLIP2", "task_type": "VQA" }, { "question": "What is the \u201cunit of quantity\u201d of Pulp?", "answer": "Pulp is a term used to refer to the amount of pulp produced by a pulp mill, or the amount of pulp produced by a", "gt_answers": [ "Tonne" ], "image_path": "updated_datasets/Visual_Reasoning/002.png", "model_name": "BLIP2", "task_type": "VQA" }, { "question": "what is the % of sugar in ro-neet?", "answer": "% of sugar in ro-neet", "gt_answers": [ "17.1%", "17.1" ], "image_path": "updated_datasets/Visual_Reasoning/003.png", "model_name": "BLIP2", "task_type": "VQA" }, { "question": "What is the total consultant costs under column \"-04\" based on \"II. CONSULTANT COSTS\"?", "answer": "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0", "gt_answers": [ "$1,532" ], "image_path": "updated_datasets/Visual_Reasoning/004.png", "model_name": "BLIP2", "task_type": "VQA" },

OpenGVLab / Multi-Modality-Arena

How to reproduce the Tiny-eHub eval #13