EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.9k stars 1.84k forks source link

Werid evaluation result of MMLU #887

Closed Yuxin715d closed 11 months ago

Yuxin715d commented 1 year ago

Hi, team. I tried to use your implementation to compute MMLU scores for some models. But I found that for some model, result is werid. For example, for llama2-13b, the script I use to test its few shot and zero shot scores is as follows:

model_path=llama2-hf/13B output_path=${output_dir}/${model_path//\//_}.txt python3 lm-evaluation-harness/main.py \ --model hf-causal-experimental \ --model_args pretrained=${model_path},dtype=float16 \ --tasks ${tasks} \ --num_fewshot 5 \ --batch_size 4 \ --device cuda:${gpu_index} \ --no_cache \ --output_path ${output_path}

python3 lm-evaluation-harness/main.py \ --model hf-causal-experimental \ --model_args pretrained=${model_path},dtype=float16 \ --tasks ${tasks} \ --batch_size 4 \ --device cuda:${gpu_index} \ --no_cache \ --output_path ${output_path}

But I found that for some tasks in MMLU, such as hendrycksTest-astronomy and hendrycksTest-college_chemistry, few shot result is worse than zero shot result.

5-shot: "hendrycksTest-astronomy": { "acc": 0.5328947368421053, "acc_stderr": 0.040601270352363966, "acc_norm": 0.5328947368421053, "acc_norm_stderr": 0.040601270352363966 },

0-shot: "hendrycksTest-astronomy": { "acc": 0.5723684210526315, "acc_stderr": 0.04026097083296564, "acc_norm": 0.5723684210526315, "acc_norm_stderr": 0.04026097083296564 },

This phenomenon is also seen in other models, such as bloomz1b7(even more obviou, zero shot is much more better). I am new in this area and not experienced. But in my thought it's not normal. Can you give some comment or advice for that? Thanks.

StellaAthena commented 1 year ago

This happens sometimes. We've extensively tested MMLU and I'm pretty sure we have a near-exact replication of the LLaMA2 paper's results.

Leaving this open until someone double checks that we replicated LLaMA 2's MMLU scores.

fancyerii commented 10 months ago

I have tested llama 2 13b and 70b on mmlu with 4.0 version. My 5-shots result of 70b is 0.632, it's not as good as the result of paper(0.68).

13b 0-shot

hf (pretrained=/nas/lili/models_hf/13B-chat), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu                                   |N/A    |none  |     0|acc   |0.5315|±  |0.1228|
| - humanities                          |N/A    |none  |     0|acc   |0.4978|±  |0.1175|
|  - formal_logic                       |Yaml   |none  |     0|acc   |0.2381|±  |0.0381|
|  - high_school_european_history       |Yaml   |none  |     0|acc   |0.6667|±  |0.0368|
|  - high_school_us_history             |Yaml   |none  |     0|acc   |0.7304|±  |0.0311|
|  - high_school_world_history          |Yaml   |none  |     0|acc   |0.7215|±  |0.0292|
|  - international_law                  |Yaml   |none  |     0|acc   |0.7190|±  |0.0410|
|  - jurisprudence                      |Yaml   |none  |     0|acc   |0.6944|±  |0.0445|
|  - logical_fallacies                  |Yaml   |none  |     0|acc   |0.6871|±  |0.0364|
|  - moral_disputes                     |Yaml   |none  |     0|acc   |0.6012|±  |0.0264|
|  - moral_scenarios                    |Yaml   |none  |     0|acc   |0.2816|±  |0.0150|
|  - philosophy                         |Yaml   |none  |     0|acc   |0.6431|±  |0.0272|
|  - prehistory                         |Yaml   |none  |     0|acc   |0.6235|±  |0.0270|
|  - professional_law                   |Yaml   |none  |     0|acc   |0.4003|±  |0.0125|
|  - world_religions                    |Yaml   |none  |     0|acc   |0.7719|±  |0.0322|
| - other                               |N/A    |none  |     0|acc   |0.6064|±  |0.1190|
|  - business_ethics                    |Yaml   |none  |     0|acc   |0.5400|±  |0.0501|
|  - clinical_knowledge                 |Yaml   |none  |     0|acc   |0.5887|±  |0.0303|
|  - college_medicine                   |Yaml   |none  |     0|acc   |0.4162|±  |0.0376|
|  - global_facts                       |Yaml   |none  |     0|acc   |0.3100|±  |0.0465|
|  - human_aging                        |Yaml   |none  |     0|acc   |0.6278|±  |0.0324|
|  - management                         |Yaml   |none  |     0|acc   |0.6893|±  |0.0458|
|  - marketing                          |Yaml   |none  |     0|acc   |0.8034|±  |0.0260|
|  - medical_genetics                   |Yaml   |none  |     0|acc   |0.5800|±  |0.0496|
|  - miscellaneous                      |Yaml   |none  |     0|acc   |0.7676|±  |0.0151|
|  - nutrition                          |Yaml   |none  |     0|acc   |0.6078|±  |0.0280|
|  - professional_accounting            |Yaml   |none  |     0|acc   |0.4078|±  |0.0293|
|  - professional_medicine              |Yaml   |none  |     0|acc   |0.4963|±  |0.0304|
|  - virology                           |Yaml   |none  |     0|acc   |0.4639|±  |0.0388|
| - social_sciences                     |N/A    |none  |     0|acc   |0.6129|±  |0.0850|
|  - econometrics                       |Yaml   |none  |     0|acc   |0.2544|±  |0.0410|
|  - high_school_geography              |Yaml   |none  |     0|acc   |0.6515|±  |0.0339|
|  - high_school_government_and_politics|Yaml   |none  |     0|acc   |0.7617|±  |0.0307|
|  - high_school_macroeconomics         |Yaml   |none  |     0|acc   |0.5000|±  |0.0254|
|  - high_school_microeconomics         |Yaml   |none  |     0|acc   |0.5042|±  |0.0325|
|  - high_school_psychology             |Yaml   |none  |     0|acc   |0.7138|±  |0.0194|
|  - human_sexuality                    |Yaml   |none  |     0|acc   |0.6412|±  |0.0421|
|  - professional_psychology            |Yaml   |none  |     0|acc   |0.5425|±  |0.0202|
|  - public_relations                   |Yaml   |none  |     0|acc   |0.6273|±  |0.0463|
|  - security_studies                   |Yaml   |none  |     0|acc   |0.6612|±  |0.0303|
|  - sociology                          |Yaml   |none  |     0|acc   |0.7413|±  |0.0310|
|  - us_foreign_policy                  |Yaml   |none  |     0|acc   |0.8100|±  |0.0394|
| - stem                                |N/A    |none  |     0|acc   |0.4285|±  |0.1137|
|  - abstract_algebra                   |Yaml   |none  |     0|acc   |0.3100|±  |0.0465|
|  - anatomy                            |Yaml   |none  |     0|acc   |0.5185|±  |0.0432|
|  - astronomy                          |Yaml   |none  |     0|acc   |0.5789|±  |0.0402|
|  - college_biology                    |Yaml   |none  |     0|acc   |0.5764|±  |0.0413|
|  - college_chemistry                  |Yaml   |none  |     0|acc   |0.3400|±  |0.0476|
|  - college_computer_science           |Yaml   |none  |     0|acc   |0.4300|±  |0.0498|
|  - college_mathematics                |Yaml   |none  |     0|acc   |0.3000|±  |0.0461|
|  - college_physics                    |Yaml   |none  |     0|acc   |0.2647|±  |0.0439|
|  - computer_security                  |Yaml   |none  |     0|acc   |0.6700|±  |0.0473|
|  - conceptual_physics                 |Yaml   |none  |     0|acc   |0.4170|±  |0.0322|
|  - electrical_engineering             |Yaml   |none  |     0|acc   |0.5448|±  |0.0415|
|  - elementary_mathematics             |Yaml   |none  |     0|acc   |0.3254|±  |0.0241|
|  - high_school_biology                |Yaml   |none  |     0|acc   |0.6323|±  |0.0274|
|  - high_school_chemistry              |Yaml   |none  |     0|acc   |0.4483|±  |0.0350|
|  - high_school_computer_science       |Yaml   |none  |     0|acc   |0.5500|±  |0.0500|
|  - high_school_mathematics            |Yaml   |none  |     0|acc   |0.2852|±  |0.0275|
|  - high_school_physics                |Yaml   |none  |     0|acc   |0.3245|±  |0.0382|
|  - high_school_statistics             |Yaml   |none  |     0|acc   |0.3333|±  |0.0321|
|  - machine_learning                   |Yaml   |none  |     0|acc   |0.3393|±  |0.0449|
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.5315|±  |0.1228|
| - humanities     |N/A    |none  |     0|acc   |0.4978|±  |0.1175|
| - other          |N/A    |none  |     0|acc   |0.6064|±  |0.1190|
| - social_sciences|N/A    |none  |     0|acc   |0.6129|±  |0.0850|
| - stem           |N/A    |none  |     0|acc   |0.4285|±  |0.1137|

70b-chat 0-shot

hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu                                   |N/A    |none  |     0|acc   |0.6111|±  |0.1329|
| - humanities                          |N/A    |none  |     0|acc   |0.5609|±  |0.1369|
|  - formal_logic                       |Yaml   |none  |     0|acc   |0.3651|±  |0.0431|
|  - high_school_european_history       |Yaml   |none  |     0|acc   |0.8061|±  |0.0309|
|  - high_school_us_history             |Yaml   |none  |     0|acc   |0.8529|±  |0.0249|
|  - high_school_world_history          |Yaml   |none  |     0|acc   |0.8143|±  |0.0253|
|  - international_law                  |Yaml   |none  |     0|acc   |0.7603|±  |0.0390|
|  - jurisprudence                      |Yaml   |none  |     0|acc   |0.8148|±  |0.0376|
|  - logical_fallacies                  |Yaml   |none  |     0|acc   |0.7730|±  |0.0329|
|  - moral_disputes                     |Yaml   |none  |     0|acc   |0.7081|±  |0.0245|
|  - moral_scenarios                    |Yaml   |none  |     0|acc   |0.2469|±  |0.0144|
|  - philosophy                         |Yaml   |none  |     0|acc   |0.7106|±  |0.0258|
|  - prehistory                         |Yaml   |none  |     0|acc   |0.6944|±  |0.0256|
|  - professional_law                   |Yaml   |none  |     0|acc   |0.4778|±  |0.0128|
|  - world_religions                    |Yaml   |none  |     0|acc   |0.8304|±  |0.0288|
| - other                               |N/A    |none  |     0|acc   |0.6775|±  |0.1138|
|  - business_ethics                    |Yaml   |none  |     0|acc   |0.5700|±  |0.0498|
|  - clinical_knowledge                 |Yaml   |none  |     0|acc   |0.6491|±  |0.0294|
|  - college_medicine                   |Yaml   |none  |     0|acc   |0.6012|±  |0.0373|
|  - global_facts                       |Yaml   |none  |     0|acc   |0.3800|±  |0.0488|
|  - human_aging                        |Yaml   |none  |     0|acc   |0.6726|±  |0.0315|
|  - management                         |Yaml   |none  |     0|acc   |0.8252|±  |0.0376|
|  - marketing                          |Yaml   |none  |     0|acc   |0.8590|±  |0.0228|
|  - medical_genetics                   |Yaml   |none  |     0|acc   |0.6200|±  |0.0488|
|  - miscellaneous                      |Yaml   |none  |     0|acc   |0.8199|±  |0.0137|
|  - nutrition                          |Yaml   |none  |     0|acc   |0.6928|±  |0.0264|
  - professional_accounting            |Yaml   |none  |     0|acc   |0.4787|±  |0.0298|
|  - professional_medicine              |Yaml   |none  |     0|acc   |0.5993|±  |0.0298|
|  - virology                           |Yaml   |none  |     0|acc   |0.5060|±  |0.0389|
| - social_sciences                     |N/A    |none  |     0|acc   |0.7267|±  |0.0780|
|  - econometrics                       |Yaml   |none  |     0|acc   |0.3772|±  |0.0456|
|  - high_school_geography              |Yaml   |none  |     0|acc   |0.7626|±  |0.0303|
|  - high_school_government_and_politics|Yaml   |none  |     0|acc   |0.8705|±  |0.0242|
|  - high_school_macroeconomics         |Yaml   |none  |     0|acc   |0.6359|±  |0.0244|
|  - high_school_microeconomics         |Yaml   |none  |     0|acc   |0.6513|±  |0.0310|
|  - high_school_psychology             |Yaml   |none  |     0|acc   |0.8349|±  |0.0159|
|  - human_sexuality                    |Yaml   |none  |     0|acc   |0.7557|±  |0.0377|
|  - professional_psychology            |Yaml   |none  |     0|acc   |0.6634|±  |0.0191|
|  - public_relations                   |Yaml   |none  |     0|acc   |0.7091|±  |0.0435|
|  - security_studies                   |Yaml   |none  |     0|acc   |0.6980|±  |0.0294|
|  - sociology                          |Yaml   |none  |     0|acc   |0.8607|±  |0.0245|
|  - us_foreign_policy                  |Yaml   |none  |     0|acc   |0.8900|±  |0.0314|
| - stem                                |N/A    |none  |     0|acc   |0.5078|±  |0.1252|
|  - abstract_algebra                   |Yaml   |none  |     0|acc   |0.3300|±  |0.0473|
|  - anatomy                            |Yaml   |none  |     0|acc   |0.5259|±  |0.0431|
|  - astronomy                          |Yaml   |none  |     0|acc   |0.7368|±  |0.0358|
|  - college_biology                    |Yaml   |none  |     0|acc   |0.7083|±  |0.0380|
|  - college_chemistry                  |Yaml   |none  |     0|acc   |0.4200|±  |0.0496|
|  - college_computer_science           |Yaml   |none  |     0|acc   |0.5500|±  |0.0500|
|  - college_mathematics                |Yaml   |none  |     0|acc   |0.3200|±  |0.0469|
|  - college_physics                    |Yaml   |none  |     0|acc   |0.3627|±  |0.0478|
|  - computer_security                  |Yaml   |none  |     0|acc   |0.6900|±  |0.0465|
|  - conceptual_physics                 |Yaml   |none  |     0|acc   |0.5191|±  |0.0327|
|  - electrical_engineering             |Yaml   |none  |     0|acc   |0.5241|±  |0.0416|
|  - elementary_mathematics             |Yaml   |none  |     0|acc   |0.3810|±  |0.0250|
|  - high_school_biology                |Yaml   |none  |     0|acc   |0.7645|±  |0.0241|
|  - high_school_chemistry              |Yaml   |none  |     0|acc   |0.4680|±  |0.0351|
|  - high_school_computer_science       |Yaml   |none  |     0|acc   |0.6300|±  |0.0485|
|  - high_school_mathematics            |Yaml   |none  |     0|acc   |0.3148|±  |0.0283|
|  - high_school_physics                |Yaml   |none  |     0|acc   |0.4371|±  |0.0405|
|  - high_school_statistics             |Yaml   |none  |     0|acc   |0.5000|±  |0.0341|
|  - machine_learning                   |Yaml   |none  |     0|acc   |0.4643|±  |0.0473|

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.6111|±  |0.1329|
| - humanities     |N/A    |none  |     0|acc   |0.5609|±  |0.1369|
| - other          |N/A    |none  |     0|acc   |0.6775|±  |0.1138|
| - social_sciences|N/A    |none  |     0|acc   |0.7267|±  |0.0780|
| - stem           |N/A    |none  |     0|acc   |0.5078|±  |0.1252|

70b-chat 5-shots,use parallel=True

CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7"  lm-eval --model hf     --model_args pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True    --tasks mmlu     --device cuda     --batch_size 1     --num_fewshot 5
hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu                                   |N/A    |none  |     0|acc   |0.6320|±  |0.1239|
| - humanities                          |N/A    |none  |     5|acc   |0.5953|±  |0.1120|
|  - formal_logic                       |Yaml   |none  |     5|acc   |0.4048|±  |0.0439|
|  - high_school_european_history       |Yaml   |none  |     5|acc   |0.7939|±  |0.0316|
|  - high_school_us_history             |Yaml   |none  |     5|acc   |0.8480|±  |0.0252|
|  - high_school_world_history          |Yaml   |none  |     5|acc   |0.8439|±  |0.0236|
|  - international_law                  |Yaml   |none  |     5|acc   |0.8182|±  |0.0352|
|  - jurisprudence                      |Yaml   |none  |     5|acc   |0.8241|±  |0.0368|
|  - logical_fallacies                  |Yaml   |none  |     5|acc   |0.7607|±  |0.0335|
|  - moral_disputes                     |Yaml   |none  |     5|acc   |0.7139|±  |0.0243|
|  - moral_scenarios                    |Yaml   |none  |     5|acc   |0.4011|±  |0.0164|
|  - philosophy                         |Yaml   |none  |     5|acc   |0.7106|±  |0.0258|
|  - prehistory                         |Yaml   |none  |     5|acc   |0.7130|±  |0.0252|
|  - professional_law                   |Yaml   |none  |     5|acc   |0.4798|±  |0.0128|
|  - world_religions                    |Yaml   |none  |     5|acc   |0.8187|±  |0.0295|
| - other                               |N/A    |none  |     5|acc   |0.6904|±  |0.1118|
|  - business_ethics                    |Yaml   |none  |     5|acc   |0.6600|±  |0.0476|
|  - clinical_knowledge                 |Yaml   |none  |     5|acc   |0.6453|±  |0.0294|
|  - college_medicine                   |Yaml   |none  |     5|acc   |0.6069|±  |0.0372|
|  - global_facts                       |Yaml   |none  |     5|acc   |0.4200|±  |0.0496|
|  - human_aging                        |Yaml   |none  |     5|acc   |0.7265|±  |0.0299|
|  - management                         |Yaml   |none  |     5|acc   |0.8058|±  |0.0392|
|  - marketing                          |Yaml   |none  |     5|acc   |0.8803|±  |0.0213|
|  - medical_genetics                   |Yaml   |none  |     5|acc   |0.6500|±  |0.0479|
|  - miscellaneous                      |Yaml   |none  |     5|acc   |0.8250|±  |0.0136|
|  - nutrition                          |Yaml   |none  |     5|acc   |0.6993|±  |0.0263|
|  - professional_accounting            |Yaml   |none  |     5|acc   |0.5071|±  |0.0298|
|  - professional_medicine              |Yaml   |none  |     5|acc   |0.5772|±  |0.0300|
|  - virology                           |Yaml   |none  |     5|acc   |0.5120|±  |0.0389|
| - social_sciences                     |N/A    |none  |     5|acc   |0.7400|±  |0.0749|
|  - econometrics                       |Yaml   |none  |     5|acc   |0.4123|±  |0.0463|
|  - high_school_geography              |Yaml   |none  |     5|acc   |0.8131|±  |0.0278|
|  - high_school_government_and_politics|Yaml   |none  |     5|acc   |0.8912|±  |0.0225|
|  - high_school_macroeconomics         |Yaml   |none  |     5|acc   |0.6385|±  |0.0244|
|  - high_school_microeconomics         |Yaml   |none  |     5|acc   |0.6639|±  |0.0307|
|  - high_school_psychology             |Yaml   |none  |     5|acc   |0.8349|±  |0.0159|
|  - human_sexuality                    |Yaml   |none  |     5|acc   |0.7099|±  |0.0398|
|  - professional_psychology            |Yaml   |none  |     5|acc   |0.6732|±  |0.0190|
|  - public_relations                   |Yaml   |none  |     5|acc   |0.6909|±  |0.0443|
|  - security_studies                   |Yaml   |none  |     5|acc   |0.7878|±  |0.0262|
|  - sociology                          |Yaml   |none  |     5|acc   |0.8657|±  |0.0241|
|  - us_foreign_policy                  |Yaml   |none  |     5|acc   |0.8700|±  |0.0338|
| - stem                                |N/A    |none  |     5|acc   |0.5236|±  |0.1294|
|  - abstract_algebra                   |Yaml   |none  |     5|acc   |0.3600|±  |0.0482|
|  - anatomy                            |Yaml   |none  |     5|acc   |0.5185|±  |0.0432|
|  - astronomy                          |Yaml   |none  |     5|acc   |0.7368|±  |0.0358|
|  - college_biology                    |Yaml   |none  |     5|acc   |0.7569|±  |0.0359|
|  - college_chemistry                  |Yaml   |none  |     5|acc   |0.4800|±  |0.0502|
|  - college_computer_science           |Yaml   |none  |     5|acc   |0.5900|±  |0.0494|
|  - college_mathematics                |Yaml   |none  |     5|acc   |0.3400|±  |0.0476|
|  - college_physics                    |Yaml   |none  |     5|acc   |0.3333|±  |0.0469|
|  - computer_security                  |Yaml   |none  |     5|acc   |0.7100|±  |0.0456|
|  - conceptual_physics                 |Yaml   |none  |     5|acc   |0.5830|±  |0.0322|
|  - electrical_engineering             |Yaml   |none  |     5|acc   |0.5862|±  |0.0410|
|  - elementary_mathematics             |Yaml   |none  |     5|acc   |0.4127|±  |0.0254|
|  - high_school_biology                |Yaml   |none  |     5|acc   |0.7613|±  |0.0243|
|  - high_school_chemistry              |Yaml   |none  |     5|acc   |0.4680|±  |0.0351|
|  - high_school_computer_science       |Yaml   |none  |     5|acc   |0.6500|±  |0.0479|
|  - high_school_mathematics            |Yaml   |none  |     5|acc   |0.3037|±  |0.0280|
|  - high_school_physics                |Yaml   |none  |     5|acc   |0.4238|±  |0.0403|
|  - high_school_statistics             |Yaml   |none  |     5|acc   |0.4815|±  |0.0341|
|  - machine_learning                   |Yaml   |none  |     5|acc   |0.4821|±  |0.0474|

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.6320|±  |0.1239|
| - humanities     |N/A    |none  |     5|acc   |0.5953|±  |0.1120|
| - other          |N/A    |none  |     5|acc   |0.6904|±  |0.1118|
| - social_sciences|N/A    |none  |     5|acc   |0.7400|±  |0.0749|
| - stem           |N/A    |none  |     5|acc   |0.5236|±  |0.1294|
fancyerii commented 10 months ago

@StellaAthena what's your result of 70b chat llama 2?

StellaAthena commented 10 months ago

@StellaAthena what's your result of 70b chat llama 2?

The entire point of this library is to make it so you don't need to ask this question. My result will be exactly the same as your result.

I have tested llama 2 13b and 70b on mmlu with 4.0 version. My 5-shots result of 70b is 0.632, it's not as good as the result of paper(0.68).

Most LLM papers have irreproducible evaluations. They use custom prompts, custom formatting, etc. that they don't report. The reason this library exists is to give a reproduce and transparent benchmark for model behavior. Mets can't even reproduce their own work... the LLaMA 2 paper reports different scores for LLaMA 1 than the LLaMA 1 paper does!

It doesn't matter if you can reproduce the numbers from other papers though, as there is not a "one true way" to do evaluations. Run the models you want to compare through the eval harness and you'll get comparable numbers.