MMStar-Benchmark / MMStar

This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
https://mmstar-benchmark.github.io
138 stars 2 forks source link

How are the values of MG and ML calculated? #4

Closed mary-0830 closed 5 months ago

mary-0830 commented 5 months ago

Hi, author, For example, the score of the Yi-34B model, my understanding is Swv = 21.9, Sv = 36.1, is that so? MG = Sv-Swv = 14.2

How is the value of St obtained?

Can you give an example? I ran the scores of mm and to with VLMEVALkit, and I don't quite understand how to calculate these two scores.

Thanks!!!

xiaoachen98 commented 5 months ago

Hi, author, For example, the score of the Yi-34B model, my understanding is Swv = 21.9, Sv = 36.1, is that so? MG = Sv-Swv = 14.2

How is the value of St obtained?

Can you give an example? I ran the scores of mm and to with VLMEVALkit, and I don't quite understand how to calculate these two scores.

Thanks!!!

We have updated the evaluation guidelines. We showcase the calculation of MG and ML for LLaVA-Next-34B. By the way, we will be glad if you can submit your own results to our leaderboard! Enjoy it!

mary-0830 commented 5 months ago

Hi, author, Thanks for your reply! I understand the calculation method, but I still have a question. Why change the model in the third step? Is its base model the same?

In the case of not looking at the graph in a multimodal large model (i.e., --gen-mode to), doesn't that mean it's already an LLM score?

Thanks for your answer again.

xiaoachen98 commented 5 months ago

Hi, author, Thanks for your reply! I understand the calculation method, but I still have a question. Why change the model in the third step? Is its base model the same?

In the case of not looking at the graph in a multimodal large model (i.e., --gen-mode to), doesn't that mean it's already an LLM score?

Thanks for your answer again.

We change the LVLM into its corresponding language model in the third step. Although the LVLM operates solely through its text encoder (LLM) in the “to” mode, it's important to note that most LVLMs unlocked the parameters of their LLMs during multi-modal training. Therefore, we have calculated the performance difference between the original LLM and the LLM after multi-modal training on the same benchmark. This is done to reflect, to a certain extent, data leakage during the multi-modal training process.