EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.41k stars 1.69k forks source link

Add mmlu average score in report #881

Closed Reason-Wang closed 10 months ago

Reason-Wang commented 11 months ago

Currently I am using hendrycksTest-* for measuring the performance on MMLU. However, the scores are reported for each single task. It would much more convenient if there are scores for each subcategory and average score. Also, MMLU might be a more well known name compared with hendrycksTest. Is it possible to add mmlu as an alternative name for testing the results for all tasks.

sanyalsunny111 commented 11 months ago

any progress on this average score?

StellaAthena commented 11 months ago

@lintangsutawika is working on this!

lintangsutawika commented 11 months ago

Currently WIP in #922

Output would be something like this

hf (pretrained=EleutherAI/pythia-2.8b), limit: None, num_fewshot: None, batch_size: 1
|                  Tasks                   |Version|Filter|Metric|  Value   |   |Stderr|
|------------------------------------------|-------|------|------|---------:|---|-----:|
|mmlu                                      |N/A    |none  |acc   |    0.2522|±  |0.0267|
|                                          |       |      |size  |14042.0000|   |      |
|-mmlu_humanities                          |N/A    |none  |acc   |    0.2349|±  |0.0205|
|--mmlu_high_school_us_history             |Yaml   |none  |acc   |    0.2353|±  |0.0298|
|--mmlu_moral_scenarios                    |Yaml   |none  |acc   |    0.2425|±  |0.0143|
|--mmlu_moral_disputes                     |Yaml   |none  |acc   |    0.2254|±  |0.0225|
|--mmlu_prehistory                         |Yaml   |none  |acc   |    0.2716|±  |0.0247|
|--mmlu_world_religions                    |Yaml   |none  |acc   |    0.2398|±  |0.0327|
|--mmlu_professional_law                   |Yaml   |none  |acc   |    0.2288|±  |0.0107|
|--mmlu_international_law                  |Yaml   |none  |acc   |    0.1983|±  |0.0364|
|--mmlu_logical_fallacies                  |Yaml   |none  |acc   |    0.1963|±  |0.0312|
|--mmlu_high_school_world_history          |Yaml   |none  |acc   |    0.2616|±  |0.0286|
|--mmlu_philosophy                         |Yaml   |none  |acc   |    0.1961|±  |0.0226|
|--mmlu_high_school_european_history       |Yaml   |none  |acc   |    0.2485|±  |0.0337|
|--mmlu_formal_logic                       |Yaml   |none  |acc   |    0.2857|±  |0.0404|
|--mmlu_jurisprudence                      |Yaml   |none  |acc   |    0.2407|±  |0.0413|
|-mmlu_other                               |N/A    |none  |acc   |    0.2803|±  |0.0290|
|--mmlu_miscellaneous                      |Yaml   |none  |acc   |    0.2771|±  |0.0160|
|--mmlu_marketing                          |Yaml   |none  |acc   |    0.2735|±  |0.0292|
|--mmlu_nutrition                          |Yaml   |none  |acc   |    0.2288|±  |0.0241|
|--mmlu_human_aging                        |Yaml   |none  |acc   |    0.2960|±  |0.0306|
|--mmlu_global_facts                       |Yaml   |none  |acc   |    0.3000|±  |0.0461|
|--mmlu_management                         |Yaml   |none  |acc   |    0.2524|±  |0.0430|
|--mmlu_medical_genetics                   |Yaml   |none  |acc   |    0.2700|±  |0.0446|
|--mmlu_clinical_knowledge                 |Yaml   |none  |acc   |    0.2566|±  |0.0269|
|--mmlu_professional_medicine              |Yaml   |none  |acc   |    0.4118|±  |0.0299|
|--mmlu_college_medicine                   |Yaml   |none  |acc   |    0.2717|±  |0.0339|
|--mmlu_virology                           |Yaml   |none  |acc   |    0.2771|±  |0.0348|
|--mmlu_professional_accounting            |Yaml   |none  |acc   |    0.2695|±  |0.0265|
|--mmlu_business_ethics                    |Yaml   |none  |acc   |    0.2200|±  |0.0416|
|-mmlu_social_sciences                     |N/A    |none  |acc   |    0.2398|±  |0.0261|
|--mmlu_high_school_government_and_politics|Yaml   |none  |acc   |    0.2487|±  |0.0312|
|--mmlu_econometrics                       |Yaml   |none  |acc   |    0.2193|±  |0.0389|
|--mmlu_us_foreign_policy                  |Yaml   |none  |acc   |    0.2400|±  |0.0429|
|--mmlu_public_relations                   |Yaml   |none  |acc   |    0.3636|±  |0.0461|
|--mmlu_high_school_microeconomics         |Yaml   |none  |acc   |    0.2353|±  |0.0276|
|--mmlu_professional_psychology            |Yaml   |none  |acc   |    0.2500|±  |0.0175|
|--mmlu_security_studies                   |Yaml   |none  |acc   |    0.1918|±  |0.0252|
|--mmlu_human_sexuality                    |Yaml   |none  |acc   |    0.2137|±  |0.0360|
|--mmlu_high_school_geography              |Yaml   |none  |acc   |    0.1970|±  |0.0283|
|--mmlu_sociology                          |Yaml   |none  |acc   |    0.2289|±  |0.0297|
|--mmlu_high_school_psychology             |Yaml   |none  |acc   |    0.2734|±  |0.0191|
|--mmlu_high_school_macroeconomics         |Yaml   |none  |acc   |    0.2128|±  |0.0208|
|-mmlu_stem                                |N/A    |none  |acc   |    0.2623|±  |0.0337|
|--mmlu_college_chemistry                  |Yaml   |none  |acc   |    0.2300|±  |0.0423|
|--mmlu_conceptual_physics                 |Yaml   |none  |acc   |    0.2979|±  |0.0299|
|--mmlu_college_mathematics                |Yaml   |none  |acc   |    0.2700|±  |0.0446|
|--mmlu_computer_security                  |Yaml   |none  |acc   |    0.2900|±  |0.0456|
|--mmlu_high_school_chemistry              |Yaml   |none  |acc   |    0.2512|±  |0.0305|
|--mmlu_high_school_physics                |Yaml   |none  |acc   |    0.2517|±  |0.0354|
|--mmlu_astronomy                          |Yaml   |none  |acc   |    0.2434|±  |0.0349|
|--mmlu_college_biology                    |Yaml   |none  |acc   |    0.2847|±  |0.0377|
|--mmlu_high_school_biology                |Yaml   |none  |acc   |    0.2581|±  |0.0249|
|--mmlu_high_school_statistics             |Yaml   |none  |acc   |    0.2269|±  |0.0286|
|--mmlu_elementary_mathematics             |Yaml   |none  |acc   |    0.2619|±  |0.0226|
|--mmlu_college_physics                    |Yaml   |none  |acc   |    0.3431|±  |0.0472|
|--mmlu_electrical_engineering             |Yaml   |none  |acc   |    0.2069|±  |0.0338|
|--mmlu_high_school_mathematics            |Yaml   |none  |acc   |    0.2630|±  |0.0268|
|--mmlu_machine_learning                   |Yaml   |none  |acc   |    0.2589|±  |0.0416|
|--mmlu_abstract_algebra                   |Yaml   |none  |acc   |    0.3000|±  |0.0461|
|--mmlu_high_school_computer_science       |Yaml   |none  |acc   |    0.2400|±  |0.0429|
|--mmlu_college_computer_science           |Yaml   |none  |acc   |    0.2800|±  |0.0451|
|--mmlu_anatomy                            |Yaml   |none  |acc   |    0.2667|±  |0.0382|

|Groups|Version|Filter|Metric|  Value   |   |Stderr|
|------|-------|------|------|---------:|---|-----:|
|mmlu  |N/A    |none  |acc   |    0.2522|±  |0.0267|
|      |       |      |size  |14042.0000|   |      |
fancyerii commented 10 months ago

how to use this pr to evaluate all mmlu tasks? any one provide an example command line for me? I have updated to latest version.

lintangsutawika commented 10 months ago

For now, you'll need to use the big-refactor branch. Then use something like this lm-eval --model hf --model_args "pretrained=EleutherAI/pythia-2.8b" --task mmlu