OpenGVLab / LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
https://openlamm.github.io/
297 stars 16 forks source link

What does `failed` mean in the test? #56

Closed zhimin-z closed 10 months ago

zhimin-z commented 10 months ago

image

Coach257 commented 10 months ago

Failed means the MLLM fails to perform the corresponding task, if the evaluation results are far below expectations. For example, in the keypoints detection task, if none of the keypoints identified by the MLLM response are correct, it is labeled Failed. In the facial classification task CelebA(Smile), since the answer range is only 'yes' or 'no', if the accuracy is below 50% which is the accuarcy of random guess, we also consider it a failure.

zhimin-z commented 10 months ago

Failed means the MLLM fails to perform the corresponding task, if the evaluation results are far below expectations. For example, in the keypoints detection task, if none of the keypoints identified by the MLLM response are correct, it is labeled Failed. In the facial classification task CelebA(Smile), since the answer range is only 'yes' or 'no', if the accuracy is below 50% which is the accuarcy of random guess, we also consider it a failure.

Thanks for your explanation. But I still think it is better to have the exact values shown rather than having it uncovered since it might be more informational than a simple "FAILED".

Coach257 commented 10 months ago

Thanks for your suggestion. But the results of LAMM-Benchmark are out-of-date, as we recommand ChEF for the latest benchmark.