MMStar-Benchmark / MMStar

This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
https://mmstar-benchmark.github.io
138 stars 2 forks source link

The calculation of ML metrics is not quite appropriate. #6

Closed echo840 closed 5 months ago

echo840 commented 5 months ago

Thank you very much for your great work, it has been very enlightening. However, I have some questions regarding the calculation of ML metrics .

Firstly, I believe it would be more appropriate to use the performance of LLMs in 2-shot scenarios, as indicated in Table 2, as the baseline in Table 3, Table 5 and Table 6. In 0-shot scenarios, some LLMs may not adhere to the answer paradigm or refuse to answer, resulting in extremely low LLM scores. This also leads to a higher likelihood of evaluating LVLMs that incorporate such LLMs as having more severe data leakage issues.

For example, in the 0-shot scenario, Qwen-7B has an average score of 25.3 in Table 1, which is the lowest among all LLMs and approximately 10 points lower than LLMs of the similar size. It even performs worse than the smaller models, Phi2-2.7B and Qwen1.5-1.8B. Clearly, this does not align with the actual capabilities of Qwen-7B. However, in the 2-shot scenario, Qwen-7B achieves a score of 35.4, which is more consistent with the scores of other models of similar size.

If we consider the 0-shot scenario as the baseline for data leakage, then Qwen-7B would have ML score of 10.1 in 2-shot compared to the 0-shot scenario while others are normally lower than 5, which indicates a significant data leakage issue, even though no actual data leakage occurs in this process. LLM has only learned how to correctly format the output of an answer rather than refusing to answer.

image

In the table above, qwen7b has a 10.1 higher score in the 2-shot scenario compared to the 0-shot scenario, which is the highest. When using 0-shot as the baseline, both monkey-chat and Sphinx-X-MoE have an ML score of 14.2, indicating the highest likelihood of data leakage during multimodal training for both models. However, when using 2-shot as the standard, monkey-chat only has an ML score of 4.1 and Sphinx-X-MoE only has an ML score of 5, suggesting a low degree of data leakage.

Therefore, we have the reason to believe that the reason monkey-chat and Sphinx-X-MoE achieves the highest ML in 0-shot is more likely due to its ability to better follow instructions and generate answers that conform to the given format after multimodal training, rather than due to data leakage.

Moreover, I think it is also essential to further consider the possibility of refusing to answer in LVLM-text.

Overall, I think it's more appropriate to use the performance of LLMs in 2-shot scenarios, as indicated in Table 2, as the baseline in Table 3, Table5 and Table 6. Because in 0-shot scenarios, some LLMs may not adhere to the answer paradigm or refuse to answer, resulting in extremely low LLM scores. This is unfair for the subsequent calculation of ML.

Thank you again for your great work!

xiaoachen98 commented 5 months ago

Thank you very much for your great work, it has been very enlightening. However, I have some questions regarding the calculation of ML metrics .

Firstly, I believe it would be more appropriate to use the performance of LLMs in 2-shot scenarios, as indicated in Table 2, as the baseline in Table 3, Table 5 and Table 6. In 0-shot scenarios, some LLMs may not adhere to the answer paradigm or refuse to answer, resulting in extremely low LLM scores. This also leads to a higher likelihood of evaluating LVLMs that incorporate such LLMs as having more severe data leakage issues.

For example, in the 0-shot scenario, Qwen-7B has an average score of 25.3 in Table 1, which is the lowest among all LLMs and approximately 10 points lower than LLMs of the similar size. It even performs worse than the smaller models, Phi2-2.7B and Qwen1.5-1.8B. Clearly, this does not align with the actual capabilities of Qwen-7B. However, in the 2-shot scenario, Qwen-7B achieves a score of 35.4, which is more consistent with the scores of other models of similar size.

If we consider the 0-shot scenario as the baseline for data leakage, then Qwen-7B would have ML score of 10.1 compared to the 0-shot scenario while others are normally lower than 5, which indicates a significant data leakage issue, even though no actual data leakage occurs in this process. LLM has only learned how to correctly format the output of an answer rather than refusing to answer.

image

In the table above, qwen7b has a 10.1 higher score in the 2-shot scenario compared to the 0-shot scenario, which is the highest. When using 0-shot as the baseline, both monkey-chat and Sphinx-X-MoE have an ML score of 14.2, indicating the highest likelihood of data leakage during multimodal training for both models. However, when using 2-shot as the standard, monkey-chat only has an ML score of 4.1 and Sphinx-X-MoE only has an ML score of 5, suggesting a low degree of data leakage.

Therefore, we have the reason to believe that the reason monkey-chat and Sphinx-X-MoE achieves the highest ML in 0-shot is more likely due to its ability to better follow instructions and generate answers that conform to the given format after multimodal training, rather than due to data leakage.

Moreover, I think it is also essential to further consider the possibility of refusing to answer in LVLM-text.

Overall, I think it's more appropriate to use the performance of LLMs in 2-shot scenarios, as indicated in Table 2, as the baseline in Table 3, Table5 and Table 6. Because in 0-shot scenarios, some LLMs may not adhere to the answer paradigm or refuse to answer, resulting in extremely low LLM scores. This is unfair for the subsequent calculation of ML.

Thank you again for your great work!

Thanks for your valuable suggestion. We will consider your suggestions and update our paper in the next version.