Closed shellhue closed 9 months ago
Seems about right. Have you tried with --batch_size 1
?
after setting --batch_size 1
, get 60.48, no big difference
hmm. you could also check on the OpenLLM [branch] https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)) as well just to be sure. The command has a changed a bit, but it's on the ReadME and mmlu can be called by --tasks hendrycksTest*
@clefourrier how have you all been aggregating the MMLU subtasks' scores for the Open LLM Leaderboard?
We switched to weighting MMLU scores by number of docs per subtask, as in https://github.com/hendrycks/test/blob/4450500f923c49f1fb1dd3d99108a0bd9717b660/evaluate.py#L82-L99 -- (we hadn't previously been reporting an averaged MMLU score in v0.3.0 !)
(average of all the results acc)
from here.
Looks like they do an unweighted average. Might be worth it to report both weighted and unweighted for MMLU (at least for the group as a whole)
hmm. you could also check on the OpenLLM [branch] https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)) as well just to be sure. The command has a changed a bit, but it's on the ReadME and mmlu can be called by
--tasks hendrycksTest*
I tried, got: | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
hendrycksTest-abstract_algebra | 1 | acc | 0.3600 | ± | 0.0482 | |
acc_norm | 0.3600 | ± | 0.0482 | |||
hendrycksTest-anatomy | 1 | acc | 0.5185 | ± | 0.0432 | |
acc_norm | 0.5185 | ± | 0.0432 | |||
hendrycksTest-astronomy | 1 | acc | 0.6842 | ± | 0.0378 | |
acc_norm | 0.6842 | ± | 0.0378 | |||
hendrycksTest-business_ethics | 1 | acc | 0.6600 | ± | 0.0476 | |
acc_norm | 0.6600 | ± | 0.0476 | |||
hendrycksTest-clinical_knowledge | 1 | acc | 0.6792 | ± | 0.0287 | |
acc_norm | 0.6792 | ± | 0.0287 | |||
hendrycksTest-college_biology | 1 | acc | 0.6736 | ± | 0.0392 | |
acc_norm | 0.6736 | ± | 0.0392 | |||
hendrycksTest-college_chemistry | 1 | acc | 0.4400 | ± | 0.0499 | |
acc_norm | 0.4400 | ± | 0.0499 | |||
hendrycksTest-college_computer_science | 1 | acc | 0.5800 | ± | 0.0496 | |
acc_norm | 0.5800 | ± | 0.0496 | |||
hendrycksTest-college_mathematics | 1 | acc | 0.3500 | ± | 0.0479 | |
acc_norm | 0.3500 | ± | 0.0479 | |||
hendrycksTest-college_medicine | 1 | acc | 0.6243 | ± | 0.0369 | |
acc_norm | 0.6243 | ± | 0.0369 | |||
hendrycksTest-college_physics | 1 | acc | 0.4020 | ± | 0.0488 | |
acc_norm | 0.4020 | ± | 0.0488 | |||
hendrycksTest-computer_security | 1 | acc | 0.7200 | ± | 0.0451 | |
acc_norm | 0.7200 | ± | 0.0451 | |||
hendrycksTest-conceptual_physics | 1 | acc | 0.5574 | ± | 0.0325 | |
acc_norm | 0.5574 | ± | 0.0325 | |||
hendrycksTest-econometrics | 1 | acc | 0.4561 | ± | 0.0469 | |
acc_norm | 0.4561 | ± | 0.0469 | |||
hendrycksTest-electrical_engineering | 1 | acc | 0.5586 | ± | 0.0414 | |
acc_norm | 0.5586 | ± | 0.0414 | |||
hendrycksTest-elementary_mathematics | 1 | acc | 0.4683 | ± | 0.0257 | |
acc_norm | 0.4683 | ± | 0.0257 | |||
hendrycksTest-formal_logic | 1 | acc | 0.4048 | ± | 0.0439 | |
acc_norm | 0.4048 | ± | 0.0439 | |||
hendrycksTest-global_facts | 1 | acc | 0.3700 | ± | 0.0485 | |
acc_norm | 0.3700 | ± | 0.0485 | |||
hendrycksTest-high_school_biology | 1 | acc | 0.7161 | ± | 0.0256 | |
acc_norm | 0.7161 | ± | 0.0256 | |||
hendrycksTest-high_school_chemistry | 1 | acc | 0.5320 | ± | 0.0351 | |
acc_norm | 0.5320 | ± | 0.0351 | |||
hendrycksTest-high_school_computer_science | 1 | acc | 0.7200 | ± | 0.0451 | |
acc_norm | 0.7200 | ± | 0.0451 | |||
hendrycksTest-high_school_european_history | 1 | acc | 0.7091 | ± | 0.0355 | |
acc_norm | 0.7091 | ± | 0.0355 | |||
hendrycksTest-high_school_geography | 1 | acc | 0.7828 | ± | 0.0294 | |
acc_norm | 0.7828 | ± | 0.0294 | |||
hendrycksTest-high_school_government_and_politics | 1 | acc | 0.8083 | ± | 0.0284 | |
acc_norm | 0.8083 | ± | 0.0284 | |||
hendrycksTest-high_school_macroeconomics | 1 | acc | 0.5821 | ± | 0.0250 | |
acc_norm | 0.5821 | ± | 0.0250 | |||
hendrycksTest-high_school_mathematics | 1 | acc | 0.3333 | ± | 0.0287 | |
acc_norm | 0.3333 | ± | 0.0287 | |||
hendrycksTest-high_school_microeconomics | 1 | acc | 0.6387 | ± | 0.0312 | |
acc_norm | 0.6387 | ± | 0.0312 | |||
hendrycksTest-high_school_physics | 1 | acc | 0.3510 | ± | 0.0390 | |
acc_norm | 0.3510 | ± | 0.0390 | |||
hendrycksTest-high_school_psychology | 1 | acc | 0.8092 | ± | 0.0168 | |
acc_norm | 0.8092 | ± | 0.0168 | |||
hendrycksTest-high_school_statistics | 1 | acc | 0.5324 | ± | 0.0340 | |
acc_norm | 0.5324 | ± | 0.0340 | |||
hendrycksTest-high_school_us_history | 1 | acc | 0.7794 | ± | 0.0291 | |
acc_norm | 0.7794 | ± | 0.0291 | |||
hendrycksTest-high_school_world_history | 1 | acc | 0.7890 | ± | 0.0266 | |
acc_norm | 0.7890 | ± | 0.0266 | |||
hendrycksTest-human_aging | 1 | acc | 0.6188 | ± | 0.0326 | |
acc_norm | 0.6188 | ± | 0.0326 | |||
hendrycksTest-human_sexuality | 1 | acc | 0.7176 | ± | 0.0395 | |
acc_norm | 0.7176 | ± | 0.0395 | |||
hendrycksTest-international_law | 1 | acc | 0.8099 | ± | 0.0358 | |
acc_norm | 0.8099 | ± | 0.0358 | |||
hendrycksTest-jurisprudence | 1 | acc | 0.7778 | ± | 0.0402 | |
acc_norm | 0.7778 | ± | 0.0402 | |||
hendrycksTest-logical_fallacies | 1 | acc | 0.6933 | ± | 0.0362 | |
acc_norm | 0.6933 | ± | 0.0362 | |||
hendrycksTest-machine_learning | 1 | acc | 0.4286 | ± | 0.0470 | |
acc_norm | 0.4286 | ± | 0.0470 | |||
hendrycksTest-management | 1 | acc | 0.7573 | ± | 0.0425 | |
acc_norm | 0.7573 | ± | 0.0425 | |||
hendrycksTest-marketing | 1 | acc | 0.8675 | ± | 0.0222 | |
acc_norm | 0.8675 | ± | 0.0222 | |||
hendrycksTest-medical_genetics | 1 | acc | 0.6900 | ± | 0.0465 | |
acc_norm | 0.6900 | ± | 0.0465 | |||
hendrycksTest-miscellaneous | 1 | acc | 0.7765 | ± | 0.0149 | |
acc_norm | 0.7765 | ± | 0.0149 | |||
hendrycksTest-moral_disputes | 1 | acc | 0.6416 | ± | 0.0258 | |
acc_norm | 0.6416 | ± | 0.0258 | |||
hendrycksTest-moral_scenarios | 1 | acc | 0.3151 | ± | 0.0155 | |
acc_norm | 0.3151 | ± | 0.0155 | |||
hendrycksTest-nutrition | 1 | acc | 0.6569 | ± | 0.0272 | |
acc_norm | 0.6569 | ± | 0.0272 | |||
hendrycksTest-philosophy | 1 | acc | 0.6688 | ± | 0.0267 | |
acc_norm | 0.6688 | ± | 0.0267 | |||
hendrycksTest-prehistory | 1 | acc | 0.6759 | ± | 0.0260 | |
acc_norm | 0.6759 | ± | 0.0260 | |||
hendrycksTest-professional_accounting | 1 | acc | 0.4433 | ± | 0.0296 | |
acc_norm | 0.4433 | ± | 0.0296 | |||
hendrycksTest-professional_law | 1 | acc | 0.4381 | ± | 0.0127 | |
acc_norm | 0.4381 | ± | 0.0127 | |||
hendrycksTest-professional_medicine | 1 | acc | 0.5919 | ± | 0.0299 | |
acc_norm | 0.5919 | ± | 0.0299 | |||
hendrycksTest-professional_psychology | 1 | acc | 0.5817 | ± | 0.0200 | |
acc_norm | 0.5817 | ± | 0.0200 | |||
hendrycksTest-public_relations | 1 | acc | 0.6091 | ± | 0.0467 | |
acc_norm | 0.6091 | ± | 0.0467 | |||
hendrycksTest-security_studies | 1 | acc | 0.7265 | ± | 0.0285 | |
acc_norm | 0.7265 | ± | 0.0285 | |||
hendrycksTest-sociology | 1 | acc | 0.8259 | ± | 0.0268 | |
acc_norm | 0.8259 | ± | 0.0268 | |||
hendrycksTest-us_foreign_policy | 1 | acc | 0.8400 | ± | 0.0368 | |
acc_norm | 0.8400 | ± | 0.0368 | |||
hendrycksTest-virology | 1 | acc | 0.4940 | ± | 0.0389 | |
acc_norm | 0.4940 | ± | 0.0389 | |||
hendrycksTest-world_religions | 1 | acc | 0.7661 | ± | 0.0325 | |
acc_norm | 0.7661 | ± | 0.0325 |
It is hard to know which score is mmlu score.
@shellhue you need to average all of these scores to get the score we have on the Open LLM Leaderboard.
@haileyschoelkopf Since we pinned a specific version of the harness, we've tried to not edit any of the mechanisms to allow reproducibility.
average all of these scores
the average score is: 61.4080701
It is very close to 61.97 reported on the leaderboard.
Then I think this solves your issue :)
I cloned lm-evaluation-harness repo from main and followed the instruction to install. Then i evaluated the model Qwen/Qwen1.5-7B on mmlu by the command below. The output mmlu score is 60.43, but the reported mmlu score on open_llm_leaderboard is 61.97. is the difference normal? or i evaluate in a wrong way?
repo commit: b177c82c
the command i used to evaluate: