OpenGVLab / LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
https://openlamm.github.io/
284 stars 15 forks source link

What are the metrics of `Omnibenchmark`, `ScienceQA`, `MMBench`, `SEED`, and `MME` benchmarks? #54

Closed zhimin-z closed 7 months ago

zhimin-z commented 7 months ago

image image

Coach257 commented 7 months ago

We no longer use the benchmark from LAMM as the benchmark for evaluating MLLMs. Instead, we have adopted our latest work, ChEF, as the benchmark for evaluation. In this framework, scenarios such as Omnibenchmark, ScienceQA, MMBench, SEED, and MME, etc., use the PPL inferencer, with the metric being Accuracy. For more details, please refer to our paper ChEF. For the usage of ChEF, please refer to the Tutorial. Of course, if you wish to use the original LAMM evaluation method, we have also fully implemented the LAMM evaluation pipeline within the ChEF framework. Please refer to LAMM Recipes for details. To be noted, the evaluation method of LAMM is no longer recommended.

zhimin-z commented 7 months ago

We no longer use the benchmark from LAMM as the benchmark for evaluating MLLMs. Instead, we have adopted our latest work, ChEF, as the benchmark for evaluation. In this framework, scenarios such as Omnibenchmark, ScienceQA, MMBench, SEED, and MME, etc., use the PPL inferencer, with the metric being Accuracy. For more details, please refer to our paper ChEF. For the usage of ChEF, please refer to the Tutorial. Of course, if you wish to use the original LAMM evaluation method, we have also fully implemented the LAMM evaluation pipeline within the ChEF framework. Please refer to LAMM Recipes for details. To be noted, the evaluation method of LAMM is no longer recommended.

Thanks for your replies. Currently, do the evaluation results on the LAMM website's leaderboard from ChEF? @Coach257

Coach257 commented 7 months ago

Yes, all the evaluation results on the leaderboard are from ChEF.