Werid evaluation result of MMLU

Yuxin715d commented 1 year ago

Hi, team. I tried to use your implementation to compute MMLU scores for some models. But I found that for some model, result is werid. For example, for llama2-13b, the script I use to test its few shot and zero shot scores is as follows:

model_path=llama2-hf/13B output_path=${output_dir}/${model_path//\//_}.txt python3 lm-evaluation-harness/main.py \ --model hf-causal-experimental \ --model_args pretrained=${model_path},dtype=float16 \ --tasks ${tasks} \ --num_fewshot 5 \ --batch_size 4 \ --device cuda:${gpu_index} \ --no_cache \ --output_path ${output_path}

python3 lm-evaluation-harness/main.py \ --model hf-causal-experimental \ --model_args pretrained=${model_path},dtype=float16 \ --tasks ${tasks} \ --batch_size 4 \ --device cuda:${gpu_index} \ --no_cache \ --output_path ${output_path}

But I found that for some tasks in MMLU, such as hendrycksTest-astronomy and hendrycksTest-college_chemistry, few shot result is worse than zero shot result.

5-shot: "hendrycksTest-astronomy": { "acc": 0.5328947368421053, "acc_stderr": 0.040601270352363966, "acc_norm": 0.5328947368421053, "acc_norm_stderr": 0.040601270352363966 },

0-shot: "hendrycksTest-astronomy": { "acc": 0.5723684210526315, "acc_stderr": 0.04026097083296564, "acc_norm": 0.5723684210526315, "acc_norm_stderr": 0.04026097083296564 },

This phenomenon is also seen in other models, such as bloomz1b7(even more obviou, zero shot is much more better). I am new in this area and not experienced. But in my thought it's not normal. Can you give some comment or advice for that? Thanks.

StellaAthena commented 1 year ago

This happens sometimes. We've extensively tested MMLU and I'm pretty sure we have a near-exact replication of the LLaMA2 paper's results.

Leaving this open until someone double checks that we replicated LLaMA 2's MMLU scores.

fancyerii commented 10 months ago

I have tested llama 2 13b and 70b on mmlu with 4.0 version. My 5-shots result of 70b is 0.632, it's not as good as the result of paper(0.68).

13b 0-shot

hf (pretrained=/nas/lili/models_hf/13B-chat), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu                                   |N/A    |none  |     0|acc   |0.5315|±  |0.1228|
| - humanities                          |N/A    |none  |     0|acc   |0.4978|±  |0.1175|
|  - formal_logic                       |Yaml   |none  |     0|acc   |0.2381|±  |0.0381|
|  - high_school_european_history       |Yaml   |none  |     0|acc   |0.6667|±  |0.0368|
|  - high_school_us_history             |Yaml   |none  |     0|acc   |0.7304|±  |0.0311|
|  - high_school_world_history          |Yaml   |none  |     0|acc   |0.7215|±  |0.0292|
|  - international_law                  |Yaml   |none  |     0|acc   |0.7190|±  |0.0410|
|  - jurisprudence                      |Yaml   |none  |     0|acc   |0.6944|±  |0.0445|
|  - logical_fallacies                  |Yaml   |none  |     0|acc   |0.6871|±  |0.0364|
|  - moral_disputes                     |Yaml   |none  |     0|acc   |0.6012|±  |0.0264|
|  - moral_scenarios                    |Yaml   |none  |     0|acc   |0.2816|±  |0.0150|
|  - philosophy                         |Yaml   |none  |     0|acc   |0.6431|±  |0.0272|
|  - prehistory                         |Yaml   |none  |     0|acc   |0.6235|±  |0.0270|
|  - professional_law                   |Yaml   |none  |     0|acc   |0.4003|±  |0.0125|
|  - world_religions                    |Yaml   |none  |     0|acc   |0.7719|±  |0.0322|
| - other                               |N/A    |none  |     0|acc   |0.6064|±  |0.1190|
|  - business_ethics                    |Yaml   |none  |     0|acc   |0.5400|±  |0.0501|
|  - clinical_knowledge                 |Yaml   |none  |     0|acc   |0.5887|±  |0.0303|
|  - college_medicine                   |Yaml   |none  |     0|acc   |0.4162|±  |0.0376|
|  - global_facts                       |Yaml   |none  |     0|acc   |0.3100|±  |0.0465|
|  - human_aging                        |Yaml   |none  |     0|acc   |0.6278|±  |0.0324|
|  - management                         |Yaml   |none  |     0|acc   |0.6893|±  |0.0458|
|  - marketing                          |Yaml   |none  |     0|acc   |0.8034|±  |0.0260|
|  - medical_genetics                   |Yaml   |none  |     0|acc   |0.5800|±  |0.0496|
|  - miscellaneous                      |Yaml   |none  |     0|acc   |0.7676|±  |0.0151|
|  - nutrition                          |Yaml   |none  |     0|acc   |0.6078|±  |0.0280|
|  - professional_accounting            |Yaml   |none  |     0|acc   |0.4078|±  |0.0293|
|  - professional_medicine              |Yaml   |none  |     0|acc   |0.4963|±  |0.0304|
|  - virology                           |Yaml   |none  |     0|acc   |0.4639|±  |0.0388|
| - social_sciences                     |N/A    |none  |     0|acc   |0.6129|±  |0.0850|
|  - econometrics                       |Yaml   |none  |     0|acc   |0.2544|±  |0.0410|
|  - high_school_geography              |Yaml   |none  |     0|acc   |0.6515|±  |0.0339|
|  - high_school_government_and_politics|Yaml   |none  |     0|acc   |0.7617|±  |0.0307|
|  - high_school_macroeconomics         |Yaml   |none  |     0|acc   |0.5000|±  |0.0254|
|  - high_school_microeconomics         |Yaml   |none  |     0|acc   |0.5042|±  |0.0325|
|  - high_school_psychology             |Yaml   |none  |     0|acc   |0.7138|±  |0.0194|
|  - human_sexuality                    |Yaml   |none  |     0|acc   |0.6412|±  |0.0421|
|  - professional_psychology            |Yaml   |none  |     0|acc   |0.5425|±  |0.0202|
|  - public_relations                   |Yaml   |none  |     0|acc   |0.6273|±  |0.0463|
|  - security_studies                   |Yaml   |none  |     0|acc   |0.6612|±  |0.0303|
|  - sociology                          |Yaml   |none  |     0|acc   |0.7413|±  |0.0310|
|  - us_foreign_policy                  |Yaml   |none  |     0|acc   |0.8100|±  |0.0394|
| - stem                                |N/A    |none  |     0|acc   |0.4285|±  |0.1137|
|  - abstract_algebra                   |Yaml   |none  |     0|acc   |0.3100|±  |0.0465|
|  - anatomy                            |Yaml   |none  |     0|acc   |0.5185|±  |0.0432|
|  - astronomy                          |Yaml   |none  |     0|acc   |0.5789|±  |0.0402|
|  - college_biology                    |Yaml   |none  |     0|acc   |0.5764|±  |0.0413|
|  - college_chemistry                  |Yaml   |none  |     0|acc   |0.3400|±  |0.0476|
|  - college_computer_science           |Yaml   |none  |     0|acc   |0.4300|±  |0.0498|
|  - college_mathematics                |Yaml   |none  |     0|acc   |0.3000|±  |0.0461|
|  - college_physics                    |Yaml   |none  |     0|acc   |0.2647|±  |0.0439|
|  - computer_security                  |Yaml   |none  |     0|acc   |0.6700|±  |0.0473|
|  - conceptual_physics                 |Yaml   |none  |     0|acc   |0.4170|±  |0.0322|
|  - electrical_engineering             |Yaml   |none  |     0|acc   |0.5448|±  |0.0415|
|  - elementary_mathematics             |Yaml   |none  |     0|acc   |0.3254|±  |0.0241|
|  - high_school_biology                |Yaml   |none  |     0|acc   |0.6323|±  |0.0274|
|  - high_school_chemistry              |Yaml   |none  |     0|acc   |0.4483|±  |0.0350|
|  - high_school_computer_science       |Yaml   |none  |     0|acc   |0.5500|±  |0.0500|
|  - high_school_mathematics            |Yaml   |none  |     0|acc   |0.2852|±  |0.0275|
|  - high_school_physics                |Yaml   |none  |     0|acc   |0.3245|±  |0.0382|
|  - high_school_statistics             |Yaml   |none  |     0|acc   |0.3333|±  |0.0321|
|  - machine_learning                   |Yaml   |none  |     0|acc   |0.3393|±  |0.0449|
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.5315|±  |0.1228|
| - humanities     |N/A    |none  |     0|acc   |0.4978|±  |0.1175|
| - other          |N/A    |none  |     0|acc   |0.6064|±  |0.1190|
| - social_sciences|N/A    |none  |     0|acc   |0.6129|±  |0.0850|
| - stem           |N/A    |none  |     0|acc   |0.4285|±  |0.1137|

70b-chat 0-shot

hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu                                   |N/A    |none  |     0|acc   |0.6111|±  |0.1329|
| - humanities                          |N/A    |none  |     0|acc   |0.5609|±  |0.1369|
|  - formal_logic                       |Yaml   |none  |     0|acc   |0.3651|±  |0.0431|
|  - high_school_european_history       |Yaml   |none  |     0|acc   |0.8061|±  |0.0309|
|  - high_school_us_history             |Yaml   |none  |     0|acc   |0.8529|±  |0.0249|
|  - high_school_world_history          |Yaml   |none  |     0|acc   |0.8143|±  |0.0253|
|  - international_law                  |Yaml   |none  |     0|acc   |0.7603|±  |0.0390|
|  - jurisprudence                      |Yaml   |none  |     0|acc   |0.8148|±  |0.0376|
|  - logical_fallacies                  |Yaml   |none  |     0|acc   |0.7730|±  |0.0329|
|  - moral_disputes                     |Yaml   |none  |     0|acc   |0.7081|±  |0.0245|
|  - moral_scenarios                    |Yaml   |none  |     0|acc   |0.2469|±  |0.0144|
|  - philosophy                         |Yaml   |none  |     0|acc   |0.7106|±  |0.0258|
|  - prehistory                         |Yaml   |none  |     0|acc   |0.6944|±  |0.0256|
|  - professional_law                   |Yaml   |none  |     0|acc   |0.4778|±  |0.0128|
|  - world_religions                    |Yaml   |none  |     0|acc   |0.8304|±  |0.0288|
| - other                               |N/A    |none  |     0|acc   |0.6775|±  |0.1138|
|  - business_ethics                    |Yaml   |none  |     0|acc   |0.5700|±  |0.0498|
|  - clinical_knowledge                 |Yaml   |none  |     0|acc   |0.6491|±  |0.0294|
|  - college_medicine                   |Yaml   |none  |     0|acc   |0.6012|±  |0.0373|
|  - global_facts                       |Yaml   |none  |     0|acc   |0.3800|±  |0.0488|
|  - human_aging                        |Yaml   |none  |     0|acc   |0.6726|±  |0.0315|
|  - management                         |Yaml   |none  |     0|acc   |0.8252|±  |0.0376|
|  - marketing                          |Yaml   |none  |     0|acc   |0.8590|±  |0.0228|
|  - medical_genetics                   |Yaml   |none  |     0|acc   |0.6200|±  |0.0488|
|  - miscellaneous                      |Yaml   |none  |     0|acc   |0.8199|±  |0.0137|
|  - nutrition                          |Yaml   |none  |     0|acc   |0.6928|±  |0.0264|
  - professional_accounting            |Yaml   |none  |     0|acc   |0.4787|±  |0.0298|
|  - professional_medicine              |Yaml   |none  |     0|acc   |0.5993|±  |0.0298|
|  - virology                           |Yaml   |none  |     0|acc   |0.5060|±  |0.0389|
| - social_sciences                     |N/A    |none  |     0|acc   |0.7267|±  |0.0780|
|  - econometrics                       |Yaml   |none  |     0|acc   |0.3772|±  |0.0456|
|  - high_school_geography              |Yaml   |none  |     0|acc   |0.7626|±  |0.0303|
|  - high_school_government_and_politics|Yaml   |none  |     0|acc   |0.8705|±  |0.0242|
|  - high_school_macroeconomics         |Yaml   |none  |     0|acc   |0.6359|±  |0.0244|
|  - high_school_microeconomics         |Yaml   |none  |     0|acc   |0.6513|±  |0.0310|
|  - high_school_psychology             |Yaml   |none  |     0|acc   |0.8349|±  |0.0159|
|  - human_sexuality                    |Yaml   |none  |     0|acc   |0.7557|±  |0.0377|
|  - professional_psychology            |Yaml   |none  |     0|acc   |0.6634|±  |0.0191|
|  - public_relations                   |Yaml   |none  |     0|acc   |0.7091|±  |0.0435|
|  - security_studies                   |Yaml   |none  |     0|acc   |0.6980|±  |0.0294|
|  - sociology                          |Yaml   |none  |     0|acc   |0.8607|±  |0.0245|
|  - us_foreign_policy                  |Yaml   |none  |     0|acc   |0.8900|±  |0.0314|
| - stem                                |N/A    |none  |     0|acc   |0.5078|±  |0.1252|
|  - abstract_algebra                   |Yaml   |none  |     0|acc   |0.3300|±  |0.0473|
|  - anatomy                            |Yaml   |none  |     0|acc   |0.5259|±  |0.0431|
|  - astronomy                          |Yaml   |none  |     0|acc   |0.7368|±  |0.0358|
|  - college_biology                    |Yaml   |none  |     0|acc   |0.7083|±  |0.0380|
|  - college_chemistry                  |Yaml   |none  |     0|acc   |0.4200|±  |0.0496|
|  - college_computer_science           |Yaml   |none  |     0|acc   |0.5500|±  |0.0500|
|  - college_mathematics                |Yaml   |none  |     0|acc   |0.3200|±  |0.0469|
|  - college_physics                    |Yaml   |none  |     0|acc   |0.3627|±  |0.0478|
|  - computer_security                  |Yaml   |none  |     0|acc   |0.6900|±  |0.0465|
|  - conceptual_physics                 |Yaml   |none  |     0|acc   |0.5191|±  |0.0327|
|  - electrical_engineering             |Yaml   |none  |     0|acc   |0.5241|±  |0.0416|
|  - elementary_mathematics             |Yaml   |none  |     0|acc   |0.3810|±  |0.0250|
|  - high_school_biology                |Yaml   |none  |     0|acc   |0.7645|±  |0.0241|
|  - high_school_chemistry              |Yaml   |none  |     0|acc   |0.4680|±  |0.0351|
|  - high_school_computer_science       |Yaml   |none  |     0|acc   |0.6300|±  |0.0485|
|  - high_school_mathematics            |Yaml   |none  |     0|acc   |0.3148|±  |0.0283|
|  - high_school_physics                |Yaml   |none  |     0|acc   |0.4371|±  |0.0405|
|  - high_school_statistics             |Yaml   |none  |     0|acc   |0.5000|±  |0.0341|
|  - machine_learning                   |Yaml   |none  |     0|acc   |0.4643|±  |0.0473|

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.6111|±  |0.1329|
| - humanities     |N/A    |none  |     0|acc   |0.5609|±  |0.1369|
| - other          |N/A    |none  |     0|acc   |0.6775|±  |0.1138|
| - social_sciences|N/A    |none  |     0|acc   |0.7267|±  |0.0780|
| - stem           |N/A    |none  |     0|acc   |0.5078|±  |0.1252|

70b-chat 5-shots,use parallel=True

CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7"  lm-eval --model hf     --model_args pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True    --tasks mmlu     --device cuda     --batch_size 1     --num_fewshot 5

hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|                 Tasks                 |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu                                   |N/A    |none  |     0|acc   |0.6320|±  |0.1239|
| - humanities                          |N/A    |none  |     5|acc   |0.5953|±  |0.1120|
|  - formal_logic                       |Yaml   |none  |     5|acc   |0.4048|±  |0.0439|
|  - high_school_european_history       |Yaml   |none  |     5|acc   |0.7939|±  |0.0316|
|  - high_school_us_history             |Yaml   |none  |     5|acc   |0.8480|±  |0.0252|
|  - high_school_world_history          |Yaml   |none  |     5|acc   |0.8439|±  |0.0236|
|  - international_law                  |Yaml   |none  |     5|acc   |0.8182|±  |0.0352|
|  - jurisprudence                      |Yaml   |none  |     5|acc   |0.8241|±  |0.0368|
|  - logical_fallacies                  |Yaml   |none  |     5|acc   |0.7607|±  |0.0335|
|  - moral_disputes                     |Yaml   |none  |     5|acc   |0.7139|±  |0.0243|
|  - moral_scenarios                    |Yaml   |none  |     5|acc   |0.4011|±  |0.0164|
|  - philosophy                         |Yaml   |none  |     5|acc   |0.7106|±  |0.0258|
|  - prehistory                         |Yaml   |none  |     5|acc   |0.7130|±  |0.0252|
|  - professional_law                   |Yaml   |none  |     5|acc   |0.4798|±  |0.0128|
|  - world_religions                    |Yaml   |none  |     5|acc   |0.8187|±  |0.0295|
| - other                               |N/A    |none  |     5|acc   |0.6904|±  |0.1118|
|  - business_ethics                    |Yaml   |none  |     5|acc   |0.6600|±  |0.0476|
|  - clinical_knowledge                 |Yaml   |none  |     5|acc   |0.6453|±  |0.0294|
|  - college_medicine                   |Yaml   |none  |     5|acc   |0.6069|±  |0.0372|
|  - global_facts                       |Yaml   |none  |     5|acc   |0.4200|±  |0.0496|
|  - human_aging                        |Yaml   |none  |     5|acc   |0.7265|±  |0.0299|
|  - management                         |Yaml   |none  |     5|acc   |0.8058|±  |0.0392|
|  - marketing                          |Yaml   |none  |     5|acc   |0.8803|±  |0.0213|
|  - medical_genetics                   |Yaml   |none  |     5|acc   |0.6500|±  |0.0479|
|  - miscellaneous                      |Yaml   |none  |     5|acc   |0.8250|±  |0.0136|
|  - nutrition                          |Yaml   |none  |     5|acc   |0.6993|±  |0.0263|
|  - professional_accounting            |Yaml   |none  |     5|acc   |0.5071|±  |0.0298|
|  - professional_medicine              |Yaml   |none  |     5|acc   |0.5772|±  |0.0300|
|  - virology                           |Yaml   |none  |     5|acc   |0.5120|±  |0.0389|
| - social_sciences                     |N/A    |none  |     5|acc   |0.7400|±  |0.0749|
|  - econometrics                       |Yaml   |none  |     5|acc   |0.4123|±  |0.0463|
|  - high_school_geography              |Yaml   |none  |     5|acc   |0.8131|±  |0.0278|
|  - high_school_government_and_politics|Yaml   |none  |     5|acc   |0.8912|±  |0.0225|
|  - high_school_macroeconomics         |Yaml   |none  |     5|acc   |0.6385|±  |0.0244|
|  - high_school_microeconomics         |Yaml   |none  |     5|acc   |0.6639|±  |0.0307|
|  - high_school_psychology             |Yaml   |none  |     5|acc   |0.8349|±  |0.0159|
|  - human_sexuality                    |Yaml   |none  |     5|acc   |0.7099|±  |0.0398|
|  - professional_psychology            |Yaml   |none  |     5|acc   |0.6732|±  |0.0190|
|  - public_relations                   |Yaml   |none  |     5|acc   |0.6909|±  |0.0443|
|  - security_studies                   |Yaml   |none  |     5|acc   |0.7878|±  |0.0262|
|  - sociology                          |Yaml   |none  |     5|acc   |0.8657|±  |0.0241|
|  - us_foreign_policy                  |Yaml   |none  |     5|acc   |0.8700|±  |0.0338|
| - stem                                |N/A    |none  |     5|acc   |0.5236|±  |0.1294|
|  - abstract_algebra                   |Yaml   |none  |     5|acc   |0.3600|±  |0.0482|
|  - anatomy                            |Yaml   |none  |     5|acc   |0.5185|±  |0.0432|
|  - astronomy                          |Yaml   |none  |     5|acc   |0.7368|±  |0.0358|
|  - college_biology                    |Yaml   |none  |     5|acc   |0.7569|±  |0.0359|
|  - college_chemistry                  |Yaml   |none  |     5|acc   |0.4800|±  |0.0502|
|  - college_computer_science           |Yaml   |none  |     5|acc   |0.5900|±  |0.0494|
|  - college_mathematics                |Yaml   |none  |     5|acc   |0.3400|±  |0.0476|
|  - college_physics                    |Yaml   |none  |     5|acc   |0.3333|±  |0.0469|
|  - computer_security                  |Yaml   |none  |     5|acc   |0.7100|±  |0.0456|
|  - conceptual_physics                 |Yaml   |none  |     5|acc   |0.5830|±  |0.0322|
|  - electrical_engineering             |Yaml   |none  |     5|acc   |0.5862|±  |0.0410|
|  - elementary_mathematics             |Yaml   |none  |     5|acc   |0.4127|±  |0.0254|
|  - high_school_biology                |Yaml   |none  |     5|acc   |0.7613|±  |0.0243|
|  - high_school_chemistry              |Yaml   |none  |     5|acc   |0.4680|±  |0.0351|
|  - high_school_computer_science       |Yaml   |none  |     5|acc   |0.6500|±  |0.0479|
|  - high_school_mathematics            |Yaml   |none  |     5|acc   |0.3037|±  |0.0280|
|  - high_school_physics                |Yaml   |none  |     5|acc   |0.4238|±  |0.0403|
|  - high_school_statistics             |Yaml   |none  |     5|acc   |0.4815|±  |0.0341|
|  - machine_learning                   |Yaml   |none  |     5|acc   |0.4821|±  |0.0474|

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.6320|±  |0.1239|
| - humanities     |N/A    |none  |     5|acc   |0.5953|±  |0.1120|
| - other          |N/A    |none  |     5|acc   |0.6904|±  |0.1118|
| - social_sciences|N/A    |none  |     5|acc   |0.7400|±  |0.0749|
| - stem           |N/A    |none  |     5|acc   |0.5236|±  |0.1294|

fancyerii commented 10 months ago

@StellaAthena what's your result of 70b chat llama 2?

StellaAthena commented 10 months ago

@StellaAthena what's your result of 70b chat llama 2?

The entire point of this library is to make it so you don't need to ask this question. My result will be exactly the same as your result.

I have tested llama 2 13b and 70b on mmlu with 4.0 version. My 5-shots result of 70b is 0.632, it's not as good as the result of paper(0.68).

Most LLM papers have irreproducible evaluations. They use custom prompts, custom formatting, etc. that they don't report. The reason this library exists is to give a reproduce and transparent benchmark for model behavior. Mets can't even reproduce their own work... the LLaMA 2 paper reports different scores for LLaMA 1 than the LLaMA 1 paper does!

It doesn't matter if you can reproduce the numbers from other papers though, as there is not a "one true way" to do evaluations. Run the models you want to compare through the eval harness and you'll get comparable numbers.

EleutherAI / lm-evaluation-harness

Werid evaluation result of MMLU #887