Open jybbjybb opened 1 month ago
I add one arguments "--apply_chat_template" and the accuracy increases to 54.78%. But still short of Meta's claim on huggingface repo (60.3%). The command to run this time is
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m lm_eval --model hf-multimodal --model_args pretrained=/mnt/LLM_checkpoints/Llama-3.2-90B-Vision-Instruct/,parallelize=True --tasks mmmu_val --batch_size 1 --apply_chat_template
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmmu_val | 0 | none | acc | ↑ | 0.5478 | ± | 0.0158 | |
- Art and Design | 0 | none | acc | ↑ | 0.6833 | ± | 0.0358 | |
- Business | 0 | none | acc | ↑ | 0.5267 | ± | 0.0411 | |
- Health and Medicine | 0 | none | acc | ↑ | 0.5933 | ± | 0.0403 | |
- Humanities and Social Science | 0 | none | acc | ↑ | 0.7417 | ± | 0.0400 | |
- Science | 0 | none | acc | ↑ | 0.4667 | ± | 0.0409 | |
- Tech and Engineering | 0 | none | acc | ↑ | 0.4000 | ± | 0.0331 |
For 11B Vision base, I got some numbers that were pretty off too:
Command: lm_eval --model hf-multimodal --model_args pretrained=meta-llama/Llama-3.2-11B-Vision,dtype=bfloat16,max_images=2,parallelize=True --tasks mmmu_val --batch_size 32
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val | 0|none | |acc |↑ |0.2667|± |0.0147|
| - Art and Design | 0|none | |acc |↑ |0.3250|± |0.0427|
| - Art | 0|none | 0|acc |↑ |0.2667|± |0.0821|
| - Art Theory | 0|none | 0|acc |↑ |0.2333|± |0.0785|
| - Design | 0|none | 0|acc |↑ |0.3333|± |0.0875|
| - Music | 0|none | 0|acc |↑ |0.4667|± |0.0926|
| - Business | 0|none | |acc |↑ |0.2733|± |0.0369|
| - Accounting | 0|none | 0|acc |↑ |0.3333|± |0.0875|
| - Economics | 0|none | 0|acc |↑ |0.3000|± |0.0851|
| - Finance | 0|none | 0|acc |↑ |0.2333|± |0.0785|
| - Manage | 0|none | 0|acc |↑ |0.2333|± |0.0785|
| - Marketing | 0|none | 0|acc |↑ |0.2667|± |0.0821|
| - Health and Medicine | 0|none | |acc |↑ |0.3200|± |0.0385|
| - Basic Medical Science | 0|none | 0|acc |↑ |0.3333|± |0.0875|
| - Clinical Medicine | 0|none | 0|acc |↑ |0.2333|± |0.0785|
| - Diagnostics and Laboratory Medicine| 0|none | 0|acc |↑ |0.3667|± |0.0895|
| - Pharmacy | 0|none | 0|acc |↑ |0.3667|± |0.0895|
| - Public Health | 0|none | 0|acc |↑ |0.3000|± |0.0851|
| - Humanities and Social Science | 0|none | |acc |↑ |0.2917|± |0.0407|
| - History | 0|none | 0|acc |↑ |0.1333|± |0.0631|
| - Literature | 0|none | 0|acc |↑ |0.2667|± |0.0821|
| - Psychology | 0|none | 0|acc |↑ |0.3000|± |0.0851|
| - Sociology | 0|none | 0|acc |↑ |0.4667|± |0.0926|
| - Science | 0|none | |acc |↑ |0.1867|± |0.0319|
| - Biology | 0|none | 0|acc |↑ |0.3000|± |0.0851|
| - Chemistry | 0|none | 0|acc |↑ |0.1333|± |0.0631|
| - Geography | 0|none | 0|acc |↑ |0.1667|± |0.0692|
| - Math | 0|none | 0|acc |↑ |0.1333|± |0.0631|
| - Physics | 0|none | 0|acc |↑ |0.2000|± |0.0743|
| - Tech and Engineering | 0|none | |acc |↑ |0.2333|± |0.0295|
| - Agriculture | 0|none | 0|acc |↑ |0.1667|± |0.0692|
| - Architecture and Engineering | 0|none | 0|acc |↑ |0.2333|± |0.0785|
| - Computer Science | 0|none | 0|acc |↑ |0.3000|± |0.0851|
Meta should have some secret prompts to increase the accuracy. Adding --apply_chat_template is one of them, but not enough.
For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!
For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!
0.4722 is a reasonable number, you can check https://github.com/jybbjybb/llama_quant/blob/main/LLaMA3.2.md for detailed results.
I have tested the LLaMA3.2_vision_90B_instruct on task "mmmu_eval", the result is as follows. The accuracy is 0.43, but the Meta's huggingface model says it is 60.3. I think Meta's results have CoT, but this mmmu_eval may not have. Can CoT explain this 60-43=17% difference?
It takes 11 hours on 4xA100 GPU to finish this 900 questions. Is it a reasonable time?
The command to run the test I use is