EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
7.09k stars 1.91k forks source link

Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

Open jybbjybb opened 1 month ago

jybbjybb commented 1 month ago

I have tested the LLaMA3.2_vision_90B_instruct on task "mmmu_eval", the result is as follows. The accuracy is 0.43, but the Meta's huggingface model says it is 60.3. I think Meta's results have CoT, but this mmmu_eval may not have. Can CoT explain this 60-43=17% difference?

It takes 11 hours on 4xA100 GPU to finish this 900 questions. Is it a reasonable time?

The command to run the test I use is

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m lm_eval --model hf-multimodal --model_args pretrained=/mnt/LLM_checkpoints/Llama-3.2-90B-Vision-Instruct/,parallelize=True --tasks mmmu_val  --batch_size 1
Groups Version Filter n-shot Metric Value Stderr
mmmu_val 0 none acc 0.4300 ± 0.0163
- Art and Design 0 none acc 0.5333 ± 0.0454
- Business 0 none acc 0.3467 ± 0.0393
- Health and Medicine 0 none acc 0.5067 ± 0.0412
- Humanities and Social Science 0 none acc 0.5750 ± 0.0451
- Science 0 none acc 0.3467 ± 0.0392
- Tech and Engineering 0 none acc 0.3524 ± 0.0329
jybbjybb commented 1 month ago

I add one arguments "--apply_chat_template" and the accuracy increases to 54.78%. But still short of Meta's claim on huggingface repo (60.3%). The command to run this time is

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m lm_eval --model hf-multimodal --model_args pretrained=/mnt/LLM_checkpoints/Llama-3.2-90B-Vision-Instruct/,parallelize=True --tasks mmmu_val  --batch_size 1 --apply_chat_template
Groups Version Filter n-shot Metric Value Stderr
mmmu_val 0 none acc 0.5478 ± 0.0158
- Art and Design 0 none acc 0.6833 ± 0.0358
- Business 0 none acc 0.5267 ± 0.0411
- Health and Medicine 0 none acc 0.5933 ± 0.0403
- Humanities and Social Science 0 none acc 0.7417 ± 0.0400
- Science 0 none acc 0.4667 ± 0.0409
- Tech and Engineering 0 none acc 0.4000 ± 0.0331
BabyChouSr commented 1 month ago

For 11B Vision base, I got some numbers that were pretty off too: Command: lm_eval --model hf-multimodal --model_args pretrained=meta-llama/Llama-3.2-11B-Vision,dtype=bfloat16,max_images=2,parallelize=True --tasks mmmu_val --batch_size 32

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val                               |      0|none  |      |acc   |↑  |0.2667|±  |0.0147|
| - Art and Design                      |      0|none  |      |acc   |↑  |0.3250|±  |0.0427|
|  - Art                                |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
|  - Art Theory                         |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Design                             |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Music                              |      0|none  |     0|acc   |↑  |0.4667|±  |0.0926|
| - Business                            |      0|none  |      |acc   |↑  |0.2733|±  |0.0369|
|  - Accounting                         |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Economics                          |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Finance                            |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Manage                             |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Marketing                          |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
| - Health and Medicine                 |      0|none  |      |acc   |↑  |0.3200|±  |0.0385|
|  - Basic Medical Science              |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Clinical Medicine                  |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Diagnostics and Laboratory Medicine|      0|none  |     0|acc   |↑  |0.3667|±  |0.0895|
|  - Pharmacy                           |      0|none  |     0|acc   |↑  |0.3667|±  |0.0895|
|  - Public Health                      |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
| - Humanities and Social Science       |      0|none  |      |acc   |↑  |0.2917|±  |0.0407|
|  - History                            |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Literature                         |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
|  - Psychology                         |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Sociology                          |      0|none  |     0|acc   |↑  |0.4667|±  |0.0926|
| - Science                             |      0|none  |      |acc   |↑  |0.1867|±  |0.0319|
|  - Biology                            |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Chemistry                          |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Geography                          |      0|none  |     0|acc   |↑  |0.1667|±  |0.0692|
|  - Math                               |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Physics                            |      0|none  |     0|acc   |↑  |0.2000|±  |0.0743|
| - Tech and Engineering                |      0|none  |      |acc   |↑  |0.2333|±  |0.0295|
|  - Agriculture                        |      0|none  |     0|acc   |↑  |0.1667|±  |0.0692|
|  - Architecture and Engineering       |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Computer Science                   |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
jybbjybb commented 1 month ago

Meta should have some secret prompts to increase the accuracy. Adding --apply_chat_template is one of them, but not enough.

BabyChouSr commented 1 month ago

For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!

jybbjybb commented 1 month ago

For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!

0.4722 is a reasonable number, you can check https://github.com/jybbjybb/llama_quant/blob/main/LLaMA3.2.md for detailed results.