EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.43k stars 1.7k forks source link

Llama 2 70b chat has lower score in mmlu than paper reported #1213

Closed fancyerii closed 8 months ago

fancyerii commented 8 months ago

I have tested llama 2 13b and 70b on mmlu with 4.0 version. My 5-shots result of 70b is 0.632, it's not as good as the result of paper(0.68).

13b 0-shot

hf (pretrained=/nas/lili/models_hf/13B-chat), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 Tasks Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.5315 ± 0.1228
- humanities N/A none 0 acc 0.4978 ± 0.1175
- formal_logic Yaml none 0 acc 0.2381 ± 0.0381
- high_school_european_history Yaml none 0 acc 0.6667 ± 0.0368
- high_school_us_history Yaml none 0 acc 0.7304 ± 0.0311
- high_school_world_history Yaml none 0 acc 0.7215 ± 0.0292
- international_law Yaml none 0 acc 0.7190 ± 0.0410
- jurisprudence Yaml none 0 acc 0.6944 ± 0.0445
- logical_fallacies Yaml none 0 acc 0.6871 ± 0.0364
- moral_disputes Yaml none 0 acc 0.6012 ± 0.0264
- moral_scenarios Yaml none 0 acc 0.2816 ± 0.0150
- philosophy Yaml none 0 acc 0.6431 ± 0.0272
- prehistory Yaml none 0 acc 0.6235 ± 0.0270
- professional_law Yaml none 0 acc 0.4003 ± 0.0125
- world_religions Yaml none 0 acc 0.7719 ± 0.0322
- other N/A none 0 acc 0.6064 ± 0.1190
- business_ethics Yaml none 0 acc 0.5400 ± 0.0501
- clinical_knowledge Yaml none 0 acc 0.5887 ± 0.0303
- college_medicine Yaml none 0 acc 0.4162 ± 0.0376
- global_facts Yaml none 0 acc 0.3100 ± 0.0465
- human_aging Yaml none 0 acc 0.6278 ± 0.0324
- management Yaml none 0 acc 0.6893 ± 0.0458
- marketing Yaml none 0 acc 0.8034 ± 0.0260
- medical_genetics Yaml none 0 acc 0.5800 ± 0.0496
- miscellaneous Yaml none 0 acc 0.7676 ± 0.0151
- nutrition Yaml none 0 acc 0.6078 ± 0.0280
- professional_accounting Yaml none 0 acc 0.4078 ± 0.0293
- professional_medicine Yaml none 0 acc 0.4963 ± 0.0304
- virology Yaml none 0 acc 0.4639 ± 0.0388
- social_sciences N/A none 0 acc 0.6129 ± 0.0850
- econometrics Yaml none 0 acc 0.2544 ± 0.0410
- high_school_geography Yaml none 0 acc 0.6515 ± 0.0339
- high_school_government_and_politics Yaml none 0 acc 0.7617 ± 0.0307
- high_school_macroeconomics Yaml none 0 acc 0.5000 ± 0.0254
- high_school_microeconomics Yaml none 0 acc 0.5042 ± 0.0325
- high_school_psychology Yaml none 0 acc 0.7138 ± 0.0194
- human_sexuality Yaml none 0 acc 0.6412 ± 0.0421
- professional_psychology Yaml none 0 acc 0.5425 ± 0.0202
- public_relations Yaml none 0 acc 0.6273 ± 0.0463
- security_studies Yaml none 0 acc 0.6612 ± 0.0303
- sociology Yaml none 0 acc 0.7413 ± 0.0310
- us_foreign_policy Yaml none 0 acc 0.8100 ± 0.0394
- stem N/A none 0 acc 0.4285 ± 0.1137
- abstract_algebra Yaml none 0 acc 0.3100 ± 0.0465
- anatomy Yaml none 0 acc 0.5185 ± 0.0432
- astronomy Yaml none 0 acc 0.5789 ± 0.0402
- college_biology Yaml none 0 acc 0.5764 ± 0.0413
- college_chemistry Yaml none 0 acc 0.3400 ± 0.0476
- college_computer_science Yaml none 0 acc 0.4300 ± 0.0498
- college_mathematics Yaml none 0 acc 0.3000 ± 0.0461
- college_physics Yaml none 0 acc 0.2647 ± 0.0439
- computer_security Yaml none 0 acc 0.6700 ± 0.0473
- conceptual_physics Yaml none 0 acc 0.4170 ± 0.0322
- electrical_engineering Yaml none 0 acc 0.5448 ± 0.0415
- elementary_mathematics Yaml none 0 acc 0.3254 ± 0.0241
- high_school_biology Yaml none 0 acc 0.6323 ± 0.0274
- high_school_chemistry Yaml none 0 acc 0.4483 ± 0.0350
- high_school_computer_science Yaml none 0 acc 0.5500 ± 0.0500
- high_school_mathematics Yaml none 0 acc 0.2852 ± 0.0275
- high_school_physics Yaml none 0 acc 0.3245 ± 0.0382
- high_school_statistics Yaml none 0 acc 0.3333 ± 0.0321
- machine_learning Yaml none 0 acc 0.3393 ± 0.0449
Groups Version Filter n-shot Metric Value Stderr
------------------ ------- ------ -----: ------ -----: --- -----:
mmlu N/A none 0 acc 0.5315 ± 0.1228
- humanities N/A none 0 acc 0.4978 ± 0.1175
- other N/A none 0 acc 0.6064 ± 0.1190
- social_sciences N/A none 0 acc 0.6129 ± 0.0850
- stem N/A none 0 acc 0.4285 ± 0.1137

70b-chat 0-shot

hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 Tasks Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.6111 ± 0.1329
- humanities N/A none 0 acc 0.5609 ± 0.1369
- formal_logic Yaml none 0 acc 0.3651 ± 0.0431
- high_school_european_history Yaml none 0 acc 0.8061 ± 0.0309
- high_school_us_history Yaml none 0 acc 0.8529 ± 0.0249
- high_school_world_history Yaml none 0 acc 0.8143 ± 0.0253
- international_law Yaml none 0 acc 0.7603 ± 0.0390
- jurisprudence Yaml none 0 acc 0.8148 ± 0.0376
- logical_fallacies Yaml none 0 acc 0.7730 ± 0.0329
- moral_disputes Yaml none 0 acc 0.7081 ± 0.0245
- moral_scenarios Yaml none 0 acc 0.2469 ± 0.0144
- philosophy Yaml none 0 acc 0.7106 ± 0.0258
- prehistory Yaml none 0 acc 0.6944 ± 0.0256
- professional_law Yaml none 0 acc 0.4778 ± 0.0128
- world_religions Yaml none 0 acc 0.8304 ± 0.0288
- other N/A none 0 acc 0.6775 ± 0.1138
- business_ethics Yaml none 0 acc 0.5700 ± 0.0498
- clinical_knowledge Yaml none 0 acc 0.6491 ± 0.0294
- college_medicine Yaml none 0 acc 0.6012 ± 0.0373
- global_facts Yaml none 0 acc 0.3800 ± 0.0488
- human_aging Yaml none 0 acc 0.6726 ± 0.0315
- management Yaml none 0 acc 0.8252 ± 0.0376
- marketing Yaml none 0 acc 0.8590 ± 0.0228
- medical_genetics Yaml none 0 acc 0.6200 ± 0.0488
- miscellaneous Yaml none 0 acc 0.8199 ± 0.0137
- nutrition Yaml none 0 acc 0.6928 ± 0.0264
- professional_accounting Yaml none 0 acc 0.4787 ± 0.0298
- professional_medicine Yaml none 0 acc 0.5993 ± 0.0298
- virology Yaml none 0 acc 0.5060 ± 0.0389
- social_sciences N/A none 0 acc 0.7267 ± 0.0780
- econometrics Yaml none 0 acc 0.3772 ± 0.0456
- high_school_geography Yaml none 0 acc 0.7626 ± 0.0303
- high_school_government_and_politics Yaml none 0 acc 0.8705 ± 0.0242
- high_school_macroeconomics Yaml none 0 acc 0.6359 ± 0.0244
- high_school_microeconomics Yaml none 0 acc 0.6513 ± 0.0310
- high_school_psychology Yaml none 0 acc 0.8349 ± 0.0159
- human_sexuality Yaml none 0 acc 0.7557 ± 0.0377
- professional_psychology Yaml none 0 acc 0.6634 ± 0.0191
- public_relations Yaml none 0 acc 0.7091 ± 0.0435
- security_studies Yaml none 0 acc 0.6980 ± 0.0294
- sociology Yaml none 0 acc 0.8607 ± 0.0245
- us_foreign_policy Yaml none 0 acc 0.8900 ± 0.0314
- stem N/A none 0 acc 0.5078 ± 0.1252
- abstract_algebra Yaml none 0 acc 0.3300 ± 0.0473
- anatomy Yaml none 0 acc 0.5259 ± 0.0431
- astronomy Yaml none 0 acc 0.7368 ± 0.0358
- college_biology Yaml none 0 acc 0.7083 ± 0.0380
- college_chemistry Yaml none 0 acc 0.4200 ± 0.0496
- college_computer_science Yaml none 0 acc 0.5500 ± 0.0500
- college_mathematics Yaml none 0 acc 0.3200 ± 0.0469
- college_physics Yaml none 0 acc 0.3627 ± 0.0478
- computer_security Yaml none 0 acc 0.6900 ± 0.0465
- conceptual_physics Yaml none 0 acc 0.5191 ± 0.0327
- electrical_engineering Yaml none 0 acc 0.5241 ± 0.0416
- elementary_mathematics Yaml none 0 acc 0.3810 ± 0.0250
- high_school_biology Yaml none 0 acc 0.7645 ± 0.0241
- high_school_chemistry Yaml none 0 acc 0.4680 ± 0.0351
- high_school_computer_science Yaml none 0 acc 0.6300 ± 0.0485
- high_school_mathematics Yaml none 0 acc 0.3148 ± 0.0283
- high_school_physics Yaml none 0 acc 0.4371 ± 0.0405
- high_school_statistics Yaml none 0 acc 0.5000 ± 0.0341
- machine_learning Yaml none 0 acc 0.4643 ± 0.0473
Groups Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.6111 ± 0.1329
- humanities N/A none 0 acc 0.5609 ± 0.1369
- other N/A none 0 acc 0.6775 ± 0.1138
- social_sciences N/A none 0 acc 0.7267 ± 0.0780
- stem N/A none 0 acc 0.5078 ± 0.1252

70b-chat 5-shots,use parallel=True

CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" lm-eval --model hf --model_args pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True --tasks mmlu --device cuda --batch_size 1 --num_fewshot 5

hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1 Tasks Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.6320 ± 0.1239
- humanities N/A none 5 acc 0.5953 ± 0.1120
- formal_logic Yaml none 5 acc 0.4048 ± 0.0439
- high_school_european_history Yaml none 5 acc 0.7939 ± 0.0316
- high_school_us_history Yaml none 5 acc 0.8480 ± 0.0252
- high_school_world_history Yaml none 5 acc 0.8439 ± 0.0236
- international_law Yaml none 5 acc 0.8182 ± 0.0352
- jurisprudence Yaml none 5 acc 0.8241 ± 0.0368
- logical_fallacies Yaml none 5 acc 0.7607 ± 0.0335
- moral_disputes Yaml none 5 acc 0.7139 ± 0.0243
- moral_scenarios Yaml none 5 acc 0.4011 ± 0.0164
- philosophy Yaml none 5 acc 0.7106 ± 0.0258
- prehistory Yaml none 5 acc 0.7130 ± 0.0252
- professional_law Yaml none 5 acc 0.4798 ± 0.0128
- world_religions Yaml none 5 acc 0.8187 ± 0.0295
- other N/A none 5 acc 0.6904 ± 0.1118
- business_ethics Yaml none 5 acc 0.6600 ± 0.0476
- clinical_knowledge Yaml none 5 acc 0.6453 ± 0.0294
- college_medicine Yaml none 5 acc 0.6069 ± 0.0372
- global_facts Yaml none 5 acc 0.4200 ± 0.0496
- human_aging Yaml none 5 acc 0.7265 ± 0.0299
- management Yaml none 5 acc 0.8058 ± 0.0392
- marketing Yaml none 5 acc 0.8803 ± 0.0213
- medical_genetics Yaml none 5 acc 0.6500 ± 0.0479
- miscellaneous Yaml none 5 acc 0.8250 ± 0.0136
- nutrition Yaml none 5 acc 0.6993 ± 0.0263
- professional_accounting Yaml none 5 acc 0.5071 ± 0.0298
- professional_medicine Yaml none 5 acc 0.5772 ± 0.0300
- virology Yaml none 5 acc 0.5120 ± 0.0389
- social_sciences N/A none 5 acc 0.7400 ± 0.0749
- econometrics Yaml none 5 acc 0.4123 ± 0.0463
- high_school_geography Yaml none 5 acc 0.8131 ± 0.0278
- high_school_government_and_politics Yaml none 5 acc 0.8912 ± 0.0225
- high_school_macroeconomics Yaml none 5 acc 0.6385 ± 0.0244
- high_school_microeconomics Yaml none 5 acc 0.6639 ± 0.0307
- high_school_psychology Yaml none 5 acc 0.8349 ± 0.0159
- human_sexuality Yaml none 5 acc 0.7099 ± 0.0398
- professional_psychology Yaml none 5 acc 0.6732 ± 0.0190
- public_relations Yaml none 5 acc 0.6909 ± 0.0443
- security_studies Yaml none 5 acc 0.7878 ± 0.0262
- sociology Yaml none 5 acc 0.8657 ± 0.0241
- us_foreign_policy Yaml none 5 acc 0.8700 ± 0.0338
- stem N/A none 5 acc 0.5236 ± 0.1294
- abstract_algebra Yaml none 5 acc 0.3600 ± 0.0482
- anatomy Yaml none 5 acc 0.5185 ± 0.0432
- astronomy Yaml none 5 acc 0.7368 ± 0.0358
- college_biology Yaml none 5 acc 0.7569 ± 0.0359
- college_chemistry Yaml none 5 acc 0.4800 ± 0.0502
- college_computer_science Yaml none 5 acc 0.5900 ± 0.0494
- college_mathematics Yaml none 5 acc 0.3400 ± 0.0476
- college_physics Yaml none 5 acc 0.3333 ± 0.0469
- computer_security Yaml none 5 acc 0.7100 ± 0.0456
- conceptual_physics Yaml none 5 acc 0.5830 ± 0.0322
- electrical_engineering Yaml none 5 acc 0.5862 ± 0.0410
- elementary_mathematics Yaml none 5 acc 0.4127 ± 0.0254
- high_school_biology Yaml none 5 acc 0.7613 ± 0.0243
- high_school_chemistry Yaml none 5 acc 0.4680 ± 0.0351
- high_school_computer_science Yaml none 5 acc 0.6500 ± 0.0479
- high_school_mathematics Yaml none 5 acc 0.3037 ± 0.0280
- high_school_physics Yaml none 5 acc 0.4238 ± 0.0403
- high_school_statistics Yaml none 5 acc 0.4815 ± 0.0341
- machine_learning Yaml none 5 acc 0.4821 ± 0.0474
Groups Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.6320 ± 0.1239
- humanities N/A none 5 acc 0.5953 ± 0.1120
- other N/A none 5 acc 0.6904 ± 0.1118
- social_sciences N/A none 5 acc 0.7400 ± 0.0749
- stem N/A none 5 acc 0.5236 ± 0.1294
StellaAthena commented 8 months ago

The scores for LLaMA and LLaMA 2 are generally considered irreproducible because they have custom undisclosed prompts.

fancyerii commented 8 months ago

thank you.