Closed Reason-Wang closed 10 months ago
any progress on this average score?
@lintangsutawika is working on this!
Currently WIP in #922
Output would be something like this
hf (pretrained=EleutherAI/pythia-2.8b), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|Metric| Value | |Stderr|
|------------------------------------------|-------|------|------|---------:|---|-----:|
|mmlu |N/A |none |acc | 0.2522|± |0.0267|
| | | |size |14042.0000| | |
|-mmlu_humanities |N/A |none |acc | 0.2349|± |0.0205|
|--mmlu_high_school_us_history |Yaml |none |acc | 0.2353|± |0.0298|
|--mmlu_moral_scenarios |Yaml |none |acc | 0.2425|± |0.0143|
|--mmlu_moral_disputes |Yaml |none |acc | 0.2254|± |0.0225|
|--mmlu_prehistory |Yaml |none |acc | 0.2716|± |0.0247|
|--mmlu_world_religions |Yaml |none |acc | 0.2398|± |0.0327|
|--mmlu_professional_law |Yaml |none |acc | 0.2288|± |0.0107|
|--mmlu_international_law |Yaml |none |acc | 0.1983|± |0.0364|
|--mmlu_logical_fallacies |Yaml |none |acc | 0.1963|± |0.0312|
|--mmlu_high_school_world_history |Yaml |none |acc | 0.2616|± |0.0286|
|--mmlu_philosophy |Yaml |none |acc | 0.1961|± |0.0226|
|--mmlu_high_school_european_history |Yaml |none |acc | 0.2485|± |0.0337|
|--mmlu_formal_logic |Yaml |none |acc | 0.2857|± |0.0404|
|--mmlu_jurisprudence |Yaml |none |acc | 0.2407|± |0.0413|
|-mmlu_other |N/A |none |acc | 0.2803|± |0.0290|
|--mmlu_miscellaneous |Yaml |none |acc | 0.2771|± |0.0160|
|--mmlu_marketing |Yaml |none |acc | 0.2735|± |0.0292|
|--mmlu_nutrition |Yaml |none |acc | 0.2288|± |0.0241|
|--mmlu_human_aging |Yaml |none |acc | 0.2960|± |0.0306|
|--mmlu_global_facts |Yaml |none |acc | 0.3000|± |0.0461|
|--mmlu_management |Yaml |none |acc | 0.2524|± |0.0430|
|--mmlu_medical_genetics |Yaml |none |acc | 0.2700|± |0.0446|
|--mmlu_clinical_knowledge |Yaml |none |acc | 0.2566|± |0.0269|
|--mmlu_professional_medicine |Yaml |none |acc | 0.4118|± |0.0299|
|--mmlu_college_medicine |Yaml |none |acc | 0.2717|± |0.0339|
|--mmlu_virology |Yaml |none |acc | 0.2771|± |0.0348|
|--mmlu_professional_accounting |Yaml |none |acc | 0.2695|± |0.0265|
|--mmlu_business_ethics |Yaml |none |acc | 0.2200|± |0.0416|
|-mmlu_social_sciences |N/A |none |acc | 0.2398|± |0.0261|
|--mmlu_high_school_government_and_politics|Yaml |none |acc | 0.2487|± |0.0312|
|--mmlu_econometrics |Yaml |none |acc | 0.2193|± |0.0389|
|--mmlu_us_foreign_policy |Yaml |none |acc | 0.2400|± |0.0429|
|--mmlu_public_relations |Yaml |none |acc | 0.3636|± |0.0461|
|--mmlu_high_school_microeconomics |Yaml |none |acc | 0.2353|± |0.0276|
|--mmlu_professional_psychology |Yaml |none |acc | 0.2500|± |0.0175|
|--mmlu_security_studies |Yaml |none |acc | 0.1918|± |0.0252|
|--mmlu_human_sexuality |Yaml |none |acc | 0.2137|± |0.0360|
|--mmlu_high_school_geography |Yaml |none |acc | 0.1970|± |0.0283|
|--mmlu_sociology |Yaml |none |acc | 0.2289|± |0.0297|
|--mmlu_high_school_psychology |Yaml |none |acc | 0.2734|± |0.0191|
|--mmlu_high_school_macroeconomics |Yaml |none |acc | 0.2128|± |0.0208|
|-mmlu_stem |N/A |none |acc | 0.2623|± |0.0337|
|--mmlu_college_chemistry |Yaml |none |acc | 0.2300|± |0.0423|
|--mmlu_conceptual_physics |Yaml |none |acc | 0.2979|± |0.0299|
|--mmlu_college_mathematics |Yaml |none |acc | 0.2700|± |0.0446|
|--mmlu_computer_security |Yaml |none |acc | 0.2900|± |0.0456|
|--mmlu_high_school_chemistry |Yaml |none |acc | 0.2512|± |0.0305|
|--mmlu_high_school_physics |Yaml |none |acc | 0.2517|± |0.0354|
|--mmlu_astronomy |Yaml |none |acc | 0.2434|± |0.0349|
|--mmlu_college_biology |Yaml |none |acc | 0.2847|± |0.0377|
|--mmlu_high_school_biology |Yaml |none |acc | 0.2581|± |0.0249|
|--mmlu_high_school_statistics |Yaml |none |acc | 0.2269|± |0.0286|
|--mmlu_elementary_mathematics |Yaml |none |acc | 0.2619|± |0.0226|
|--mmlu_college_physics |Yaml |none |acc | 0.3431|± |0.0472|
|--mmlu_electrical_engineering |Yaml |none |acc | 0.2069|± |0.0338|
|--mmlu_high_school_mathematics |Yaml |none |acc | 0.2630|± |0.0268|
|--mmlu_machine_learning |Yaml |none |acc | 0.2589|± |0.0416|
|--mmlu_abstract_algebra |Yaml |none |acc | 0.3000|± |0.0461|
|--mmlu_high_school_computer_science |Yaml |none |acc | 0.2400|± |0.0429|
|--mmlu_college_computer_science |Yaml |none |acc | 0.2800|± |0.0451|
|--mmlu_anatomy |Yaml |none |acc | 0.2667|± |0.0382|
|Groups|Version|Filter|Metric| Value | |Stderr|
|------|-------|------|------|---------:|---|-----:|
|mmlu |N/A |none |acc | 0.2522|± |0.0267|
| | | |size |14042.0000| | |
how to use this pr to evaluate all mmlu tasks? any one provide an example command line for me? I have updated to latest version.
For now, you'll need to use the big-refactor
branch. Then use something like this
lm-eval --model hf --model_args "pretrained=EleutherAI/pythia-2.8b" --task mmlu
Currently I am using hendrycksTest-* for measuring the performance on MMLU. However, the scores are reported for each single task. It would much more convenient if there are scores for each subcategory and average score. Also, MMLU might be a more well known name compared with hendrycksTest. Is it possible to add mmlu as an alternative name for testing the results for all tasks.