MMMU-Benchmark / MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
https://mmmu-benchmark.github.io/
Apache License 2.0
327 stars 21 forks source link

Representing LLaVa-1.5-13b #10

Closed teasgen closed 9 months ago

teasgen commented 9 months ago

Hi, I'm trying to represent LLaVa-1.5-13b on your benchmark using prompts like:

"Question: <image 1> Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\n Option: (A) $6\n(B) $7\n(C) $8\n(D) $9\nAnswer:"

for multiple-choice (e.g validation_Accounting_1)

Or for open questions:

Question: Using a finite summation, compute the  initial deflection at midspan for the beam in  Figure P8.42. Given: E = 3000 kips/in.2 .  Use 3-ft segments. Assume I = 0.5IG. <image 1>\nAnswer:

(e.g validation_Architecture_and_Engineering_14) But I'm getting only 32.5% on validation split vs 36.4% (I tried only questions with single input image - 856/900). What could be the problem?

Originally posted by @teasgen in https://github.com/MMMU-Benchmark/MMMU/issues/5#issuecomment-1845122257

Rubics-Xuan commented 9 months ago

Same for me. By using the prompt "{question}\n{choice_txt}\nAnswer with the option's letter from the given choices directly." for LLaVA-v1.5-13b, I only get approximately 23% on whole validation split. Hope you can help me with what could be the problem.

xiangyue9607 commented 9 months ago

@teasgen Thanks! See #9

drogozhang commented 9 months ago

Hi, thanks for your interest!

please refer #9 for the detailed prompt of Llava-1.5

teasgen commented 9 months ago

Hi, thanks for your interest!

please refer #9 for the detailed prompt of Llava-1.5

Thanks!

drogozhang commented 9 months ago

Hi @teasgen , quick update:

we uploaded the inference code and sample results out of the code, achieving 35.8 on the val set.

Let me know if you have any other questions.

teasgen commented 9 months ago

Hi @teasgen , quick update:

we uploaded the inference code and sample results out of the code, achieving 35.8 on the val set.

Let me know if you have any other questions.

Thank you again! But why this code achieves only 35.8, therefore in paper was reported about 36.4. Does it mean, that the dataset has huge variance? Do you have some intuition within what limits can the score change?

drogozhang commented 9 months ago

@teasgen Like I mentioned in the other repo, llava is not deterministic, so its outputs can be slightly different with different random seeds. Here I didn't try to find a perfect random seed that matches the performance from previous code.

Also, for other deterministic models like blip2, the results would be the same...