MMMU-Benchmark / MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
https://mmmu-benchmark.github.io/
Apache License 2.0
352 stars 24 forks source link

Model Evaluatation #9

Closed Rubics-Xuan closed 11 months ago

Rubics-Xuan commented 11 months ago

Thanks for your great work! Do you have any plans to release the evaluation code for LLaVA-v1.5? Looking forward to your reply.

zhangmozhe commented 11 months ago

The same question. The official LLaVA v1.5 ckpt only supports single image for inference, then how to evaluate the model for multiple images in your case? Thank you.

xiangyue9607 commented 11 months ago

We followed the same demo code of LLaVA V1.5 in their official repo with the released prompts. For the multiple-image examples, we only input the first image into the model. We applied the same rule for other baselines that only support the single image as well. We will update the paper and GitHub README. Hope this clarifies.

drogozhang commented 11 months ago

Hi, thanks for your interest!

For multi-choice question and open question, Llava uses different end prompt:

Here is the prompt in the official repo:

  1. Short-answer (e.g. VQAv2, MME).
<question>
Answer the question using a single word or phrase.
  1. Option-only for multiple-choice (e.g. MMBench, SEED-Bench).
<question>
A. <option_1>
B. <option_2>
C. <option_3>
D. <option_4>
Answer with the option's letter from the given choices directly.

Here is an example of the input we fed into Llava-1.5.

for open-question:

The amplifier circuit in <image 1> uses an npn silicon transistor with the following maximum ratings: $P_{C,max}$ = 2.5 W (after derating) $BV_{CEO}$ = 80 V $V_{CE,sat}$ = 2 V.  Calculate the maximum power dissipated by the load, $R_L$.
Answer the question using a single word or phrase.

for multi-choice:

Calculate the admittance Y(s) = [{I_1(s)} / {V_1(s)}] for the network shown in <image 1>.
(A) {(3s) / (11s + 8)}
(B) {(6s) / (10s + 8)}
(C) {(4s) / (11s + 8)}
(D) {(6s) / (11s + 8)}

Answer with the option's letter from the given choices directly.

Note that for multi-choice question options, we use \n to separate them as well, and we represent options with brackets .e.g, (A), (B), (C).

Let me know if there's any other questions.

Rubics-Xuan commented 11 months ago

Many thanks for your prompt reply! Per your helpful suggestions, currently I can re-implement the performance of LLaVA-v1.5 on MMMU benchmark to 33.0. I am just wondering what else factors can I check to achieve 36.4.

drogozhang commented 11 months ago

Are you using the Vicuna-13b? And could you share us with your code so we can double check and provide more clues. Currently I am in NeurIPS and may not be able to reply soon in the next few days.

Rubics-Xuan commented 11 months ago

Thanks for your reply. I use the LLaVA-v1.5 weight from https://huggingface.co/liuhaotian/llava-v1.5-13b, and the corresponding model performance still stuck around 33.0 on the validation set.

drogozhang commented 11 months ago

Thanks! I wonder if you try other random seeds as the output of llavas can be different with different random seeds.

Rubics-Xuan commented 11 months ago

Besides, there is another question I hope you can help me out. I am just wondering which version of GPT-4V is utilized in the main paper. I try to give the same question as the presented failure cases in your paper appendix to the GPT-4V on the website. But it turns out that GPT-4V can solve nearly half of these failure cases among the 15 questions I gave to it. Do you have any clues about what may lead to this result?

xiangyue9607 commented 10 months ago

Besides, there is another question I hope you can help me out. I am just wondering which version of GPT-4V is utilized in the main paper. I try to give the same question as the presented failure cases in your paper appendix to the GPT-4V on the website. But it turns out that GPT-4V can solve nearly half of these failure cases among the 15 questions I gave to it. Do you have any clues about what may lead to this result?

Thanks for your question! We tested all the questions before Nov major update of the ChatGPT playground. It is very likely that OAI improved its models during that update, which is something that we are not sure. The current playground by default will call code interpreter and other tools to solve problems. That could be another factor leading to fewer error cases.

drogozhang commented 10 months ago

@Rubics-Xuan We uploaded the inference code, please have a look. If the results don't match, try different seeds.

Also, we uploaded the results out of this code, achieving 35.8 on the val set.

Let me know if you have any other questions.

Rubics-Xuan commented 10 months ago

Many thanks for all the helpful reply!!!