MMMU-Benchmark / MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
https://mmmu-benchmark.github.io/
Apache License 2.0
353 stars 24 forks source link

How was "prompt engineering" performed? #12

Closed mckinziebrandon closed 10 months ago

mckinziebrandon commented 10 months ago

Hi, great work! I'm not seeing any examples of how you convert the input documents to actual prompts for the model. In the paper, the only relevant snippet seems to be:

If models do not provide prompts for task types in MMMU, we conduct prompt engineering on the validation set and use the most effective prompt for the zero-shot setup in the main experiments.

Can you please provide examples of how you formatted the inputs to any of the models you evaluated this on? Thanks!

Note: I see that in #5 you clarify that you follow MMLU, but this seems to contradict the statement in the paper about prompt engineering on the validation set. Can you clarify that in particular?

drogozhang commented 10 months ago

Thanks for the question.

As mentioned in the paper, for each method we use prompts reported in the original paper/repo, if available. Otherwise, we explored prompts on the validation set and used the best one for the test set evaluation.

We recently uploaded the inference code for llava (including the prompt construction). You can check Prompt Template or Our reply in Issue 9 for prompt example.

mckinziebrandon commented 10 months ago

Great, thanks for the quick response!