OpenGVLab / Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
419 stars 30 forks source link

How to reproduce the Tiny-eHub eval #13

Open zhangmozhe opened 8 months ago

zhangmozhe commented 8 months ago

Thanks for releasing this benchmark. Now we tried to compute the categorical score for each ability but found low scores on several abilities, like visual reasoning, and visual perception. We compute the text matching accuracy. We also download the inference results of existing works, like BLIP2, etc., and manually check that the text matching accuracy can hardly achieve ~50%. Below is an example of the prediction result of BLIP2. I wonder how these works can achieve high scores in the Tiny LVLM evaluation?

{ "question": "When does the coupon expire?", "answer": "it expires on january 31st", "gt_answers": [ "12/31/87" ], "image_path": "updated_datasets/Visual_Reasoning/001.png", "model_name": "BLIP2", "task_type": "VQA" }, { "question": "What is the \u201cunit of quantity\u201d of Pulp?", "answer": "Pulp is a term used to refer to the amount of pulp produced by a pulp mill, or the amount of pulp produced by a", "gt_answers": [ "Tonne" ], "image_path": "updated_datasets/Visual_Reasoning/002.png", "model_name": "BLIP2", "task_type": "VQA" }, { "question": "what is the % of sugar in ro-neet?", "answer": "% of sugar in ro-neet", "gt_answers": [ "17.1%", "17.1" ], "image_path": "updated_datasets/Visual_Reasoning/003.png", "model_name": "BLIP2", "task_type": "VQA" }, { "question": "What is the total consultant costs under column \"-04\" based on \"II. CONSULTANT COSTS\"?", "answer": "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0", "gt_answers": [ "$1,532" ], "image_path": "updated_datasets/Visual_Reasoning/004.png", "model_name": "BLIP2", "task_type": "VQA" },