Llama 2 70b chat has lower score in mmlu than paper reported

fancyerii commented 8 months ago

I have tested llama 2 13b and 70b on mmlu with 4.0 version. My 5-shots result of 70b is 0.632, it's not as good as the result of paper(0.68).

13b 0-shot

hf (pretrained=/nas/lili/models_hf/13B-chat), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
mmlu	N/A	none	0	acc	0.5315	±	0.1228
- humanities	N/A	none	0	acc	0.4978	±	0.1175
- formal_logic	Yaml	none	0	acc	0.2381	±	0.0381
- high_school_european_history	Yaml	none	0	acc	0.6667	±	0.0368
- high_school_us_history	Yaml	none	0	acc	0.7304	±	0.0311
- high_school_world_history	Yaml	none	0	acc	0.7215	±	0.0292
- international_law	Yaml	none	0	acc	0.7190	±	0.0410
- jurisprudence	Yaml	none	0	acc	0.6944	±	0.0445
- logical_fallacies	Yaml	none	0	acc	0.6871	±	0.0364
- moral_disputes	Yaml	none	0	acc	0.6012	±	0.0264
- moral_scenarios	Yaml	none	0	acc	0.2816	±	0.0150
- philosophy	Yaml	none	0	acc	0.6431	±	0.0272
- prehistory	Yaml	none	0	acc	0.6235	±	0.0270
- professional_law	Yaml	none	0	acc	0.4003	±	0.0125
- world_religions	Yaml	none	0	acc	0.7719	±	0.0322
- other	N/A	none	0	acc	0.6064	±	0.1190
- business_ethics	Yaml	none	0	acc	0.5400	±	0.0501
- clinical_knowledge	Yaml	none	0	acc	0.5887	±	0.0303
- college_medicine	Yaml	none	0	acc	0.4162	±	0.0376
- global_facts	Yaml	none	0	acc	0.3100	±	0.0465
- human_aging	Yaml	none	0	acc	0.6278	±	0.0324
- management	Yaml	none	0	acc	0.6893	±	0.0458
- marketing	Yaml	none	0	acc	0.8034	±	0.0260
- medical_genetics	Yaml	none	0	acc	0.5800	±	0.0496
- miscellaneous	Yaml	none	0	acc	0.7676	±	0.0151
- nutrition	Yaml	none	0	acc	0.6078	±	0.0280
- professional_accounting	Yaml	none	0	acc	0.4078	±	0.0293
- professional_medicine	Yaml	none	0	acc	0.4963	±	0.0304
- virology	Yaml	none	0	acc	0.4639	±	0.0388
- social_sciences	N/A	none	0	acc	0.6129	±	0.0850
- econometrics	Yaml	none	0	acc	0.2544	±	0.0410
- high_school_geography	Yaml	none	0	acc	0.6515	±	0.0339
- high_school_government_and_politics	Yaml	none	0	acc	0.7617	±	0.0307
- high_school_macroeconomics	Yaml	none	0	acc	0.5000	±	0.0254
- high_school_microeconomics	Yaml	none	0	acc	0.5042	±	0.0325
- high_school_psychology	Yaml	none	0	acc	0.7138	±	0.0194
- human_sexuality	Yaml	none	0	acc	0.6412	±	0.0421
- professional_psychology	Yaml	none	0	acc	0.5425	±	0.0202
- public_relations	Yaml	none	0	acc	0.6273	±	0.0463
- security_studies	Yaml	none	0	acc	0.6612	±	0.0303
- sociology	Yaml	none	0	acc	0.7413	±	0.0310
- us_foreign_policy	Yaml	none	0	acc	0.8100	±	0.0394
- stem	N/A	none	0	acc	0.4285	±	0.1137
- abstract_algebra	Yaml	none	0	acc	0.3100	±	0.0465
- anatomy	Yaml	none	0	acc	0.5185	±	0.0432
- astronomy	Yaml	none	0	acc	0.5789	±	0.0402
- college_biology	Yaml	none	0	acc	0.5764	±	0.0413
- college_chemistry	Yaml	none	0	acc	0.3400	±	0.0476
- college_computer_science	Yaml	none	0	acc	0.4300	±	0.0498
- college_mathematics	Yaml	none	0	acc	0.3000	±	0.0461
- college_physics	Yaml	none	0	acc	0.2647	±	0.0439
- computer_security	Yaml	none	0	acc	0.6700	±	0.0473
- conceptual_physics	Yaml	none	0	acc	0.4170	±	0.0322
- electrical_engineering	Yaml	none	0	acc	0.5448	±	0.0415
- elementary_mathematics	Yaml	none	0	acc	0.3254	±	0.0241
- high_school_biology	Yaml	none	0	acc	0.6323	±	0.0274
- high_school_chemistry	Yaml	none	0	acc	0.4483	±	0.0350
- high_school_computer_science	Yaml	none	0	acc	0.5500	±	0.0500
- high_school_mathematics	Yaml	none	0	acc	0.2852	±	0.0275
- high_school_physics	Yaml	none	0	acc	0.3245	±	0.0382
- high_school_statistics	Yaml	none	0	acc	0.3333	±	0.0321
- machine_learning	Yaml	none	0	acc	0.3393	±	0.0449
Groups	Version	Filter	n-shot	Metric	Value		Stderr
------------------	-------	------	-----:	------	-----:	---	-----:
mmlu	N/A	none	0	acc	0.5315	±	0.1228
- humanities	N/A	none	0	acc	0.4978	±	0.1175
- other	N/A	none	0	acc	0.6064	±	0.1190
- social_sciences	N/A	none	0	acc	0.6129	±	0.0850
- stem	N/A	none	0	acc	0.4285	±	0.1137

70b-chat 0-shot

hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	n-shot	Metric	Value
mmlu	N/A	none	acc	0.6111	±	0.1329
- humanities	N/A	none	acc	0.5609	±	0.1369
- formal_logic	Yaml	none	acc	0.3651	±	0.0431
- high_school_european_history	Yaml	none	acc	0.8061	±	0.0309
- high_school_us_history	Yaml	none	acc	0.8529	±	0.0249
- high_school_world_history	Yaml	none	acc	0.8143	±	0.0253
- international_law	Yaml	none	acc	0.7603	±	0.0390
- jurisprudence	Yaml	none	acc	0.8148	±	0.0376
- logical_fallacies	Yaml	none	acc	0.7730	±	0.0329
- moral_disputes	Yaml	none	acc	0.7081	±	0.0245
- moral_scenarios	Yaml	none	acc	0.2469	±	0.0144
- philosophy	Yaml	none	acc	0.7106	±	0.0258
- prehistory	Yaml	none	acc	0.6944	±	0.0256
- professional_law	Yaml	none	acc	0.4778	±	0.0128
- world_religions	Yaml	none	acc	0.8304	±	0.0288
- other	N/A	none	acc	0.6775	±	0.1138
- business_ethics	Yaml	none	acc	0.5700	±	0.0498
- clinical_knowledge	Yaml	none	acc	0.6491	±	0.0294
- college_medicine	Yaml	none	acc	0.6012	±	0.0373
- global_facts	Yaml	none	acc	0.3800	±	0.0488
- human_aging	Yaml	none	acc	0.6726	±	0.0315
- management	Yaml	none	acc	0.8252	±	0.0376
- marketing	Yaml	none	acc	0.8590	±	0.0228
- medical_genetics	Yaml	none	acc	0.6200	±	0.0488
- miscellaneous	Yaml	none	acc	0.8199	±	0.0137
- nutrition	Yaml	none	acc	0.6928	±	0.0264
- professional_accounting	Yaml	none	acc	0.4787	±	0.0298
- professional_medicine	Yaml	none	acc	0.5993	±	0.0298
- virology	Yaml	none	acc	0.5060	±	0.0389
- social_sciences	N/A	none	acc	0.7267	±	0.0780
- econometrics	Yaml	none	acc	0.3772	±	0.0456
- high_school_geography	Yaml	none	acc	0.7626	±	0.0303
- high_school_government_and_politics	Yaml	none	acc	0.8705	±	0.0242
- high_school_macroeconomics	Yaml	none	acc	0.6359	±	0.0244
- high_school_microeconomics	Yaml	none	acc	0.6513	±	0.0310
- high_school_psychology	Yaml	none	acc	0.8349	±	0.0159
- human_sexuality	Yaml	none	acc	0.7557	±	0.0377
- professional_psychology	Yaml	none	acc	0.6634	±	0.0191
- public_relations	Yaml	none	acc	0.7091	±	0.0435
- security_studies	Yaml	none	acc	0.6980	±	0.0294
- sociology	Yaml	none	acc	0.8607	±	0.0245
- us_foreign_policy	Yaml	none	acc	0.8900	±	0.0314
- stem	N/A	none	acc	0.5078	±	0.1252
- abstract_algebra	Yaml	none	acc	0.3300	±	0.0473
- anatomy	Yaml	none	acc	0.5259	±	0.0431
- astronomy	Yaml	none	acc	0.7368	±	0.0358
- college_biology	Yaml	none	acc	0.7083	±	0.0380
- college_chemistry	Yaml	none	acc	0.4200	±	0.0496
- college_computer_science	Yaml	none	acc	0.5500	±	0.0500
- college_mathematics	Yaml	none	acc	0.3200	±	0.0469
- college_physics	Yaml	none	acc	0.3627	±	0.0478
- computer_security	Yaml	none	acc	0.6900	±	0.0465
- conceptual_physics	Yaml	none	acc	0.5191	±	0.0327
- electrical_engineering	Yaml	none	acc	0.5241	±	0.0416
- elementary_mathematics	Yaml	none	acc	0.3810	±	0.0250
- high_school_biology	Yaml	none	acc	0.7645	±	0.0241
- high_school_chemistry	Yaml	none	acc	0.4680	±	0.0351
- high_school_computer_science	Yaml	none	acc	0.6300	±	0.0485
- high_school_mathematics	Yaml	none	acc	0.3148	±	0.0283
- high_school_physics	Yaml	none	acc	0.4371	±	0.0405
- high_school_statistics	Yaml	none	acc	0.5000	±	0.0341
- machine_learning	Yaml	none	acc	0.4643	±	0.0473

Groups	Version	Filter	Metric	Value		Stderr
mmlu	N/A	none	acc	0.6111	±	0.1329
- humanities	N/A	none	acc	0.5609	±	0.1369
- other	N/A	none	acc	0.6775	±	0.1138
- social_sciences	N/A	none	acc	0.7267	±	0.0780
- stem	N/A	none	acc	0.5078	±	0.1252

70b-chat 5-shots,use parallel=True

CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" lm-eval --model hf --model_args pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True --tasks mmlu --device cuda --batch_size 1 --num_fewshot 5

hf (pretrained=/nas/lili/models_hf/70B-chat-hf,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
mmlu	N/A	none	0	acc	0.6320	±	0.1239
- humanities	N/A	none	5	acc	0.5953	±	0.1120
- formal_logic	Yaml	none	5	acc	0.4048	±	0.0439
- high_school_european_history	Yaml	none	5	acc	0.7939	±	0.0316
- high_school_us_history	Yaml	none	5	acc	0.8480	±	0.0252
- high_school_world_history	Yaml	none	5	acc	0.8439	±	0.0236
- international_law	Yaml	none	5	acc	0.8182	±	0.0352
- jurisprudence	Yaml	none	5	acc	0.8241	±	0.0368
- logical_fallacies	Yaml	none	5	acc	0.7607	±	0.0335
- moral_disputes	Yaml	none	5	acc	0.7139	±	0.0243
- moral_scenarios	Yaml	none	5	acc	0.4011	±	0.0164
- philosophy	Yaml	none	5	acc	0.7106	±	0.0258
- prehistory	Yaml	none	5	acc	0.7130	±	0.0252
- professional_law	Yaml	none	5	acc	0.4798	±	0.0128
- world_religions	Yaml	none	5	acc	0.8187	±	0.0295
- other	N/A	none	5	acc	0.6904	±	0.1118
- business_ethics	Yaml	none	5	acc	0.6600	±	0.0476
- clinical_knowledge	Yaml	none	5	acc	0.6453	±	0.0294
- college_medicine	Yaml	none	5	acc	0.6069	±	0.0372
- global_facts	Yaml	none	5	acc	0.4200	±	0.0496
- human_aging	Yaml	none	5	acc	0.7265	±	0.0299
- management	Yaml	none	5	acc	0.8058	±	0.0392
- marketing	Yaml	none	5	acc	0.8803	±	0.0213
- medical_genetics	Yaml	none	5	acc	0.6500	±	0.0479
- miscellaneous	Yaml	none	5	acc	0.8250	±	0.0136
- nutrition	Yaml	none	5	acc	0.6993	±	0.0263
- professional_accounting	Yaml	none	5	acc	0.5071	±	0.0298
- professional_medicine	Yaml	none	5	acc	0.5772	±	0.0300
- virology	Yaml	none	5	acc	0.5120	±	0.0389
- social_sciences	N/A	none	5	acc	0.7400	±	0.0749
- econometrics	Yaml	none	5	acc	0.4123	±	0.0463
- high_school_geography	Yaml	none	5	acc	0.8131	±	0.0278
- high_school_government_and_politics	Yaml	none	5	acc	0.8912	±	0.0225
- high_school_macroeconomics	Yaml	none	5	acc	0.6385	±	0.0244
- high_school_microeconomics	Yaml	none	5	acc	0.6639	±	0.0307
- high_school_psychology	Yaml	none	5	acc	0.8349	±	0.0159
- human_sexuality	Yaml	none	5	acc	0.7099	±	0.0398
- professional_psychology	Yaml	none	5	acc	0.6732	±	0.0190
- public_relations	Yaml	none	5	acc	0.6909	±	0.0443
- security_studies	Yaml	none	5	acc	0.7878	±	0.0262
- sociology	Yaml	none	5	acc	0.8657	±	0.0241
- us_foreign_policy	Yaml	none	5	acc	0.8700	±	0.0338
- stem	N/A	none	5	acc	0.5236	±	0.1294
- abstract_algebra	Yaml	none	5	acc	0.3600	±	0.0482
- anatomy	Yaml	none	5	acc	0.5185	±	0.0432
- astronomy	Yaml	none	5	acc	0.7368	±	0.0358
- college_biology	Yaml	none	5	acc	0.7569	±	0.0359
- college_chemistry	Yaml	none	5	acc	0.4800	±	0.0502
- college_computer_science	Yaml	none	5	acc	0.5900	±	0.0494
- college_mathematics	Yaml	none	5	acc	0.3400	±	0.0476
- college_physics	Yaml	none	5	acc	0.3333	±	0.0469
- computer_security	Yaml	none	5	acc	0.7100	±	0.0456
- conceptual_physics	Yaml	none	5	acc	0.5830	±	0.0322
- electrical_engineering	Yaml	none	5	acc	0.5862	±	0.0410
- elementary_mathematics	Yaml	none	5	acc	0.4127	±	0.0254
- high_school_biology	Yaml	none	5	acc	0.7613	±	0.0243
- high_school_chemistry	Yaml	none	5	acc	0.4680	±	0.0351
- high_school_computer_science	Yaml	none	5	acc	0.6500	±	0.0479
- high_school_mathematics	Yaml	none	5	acc	0.3037	±	0.0280
- high_school_physics	Yaml	none	5	acc	0.4238	±	0.0403
- high_school_statistics	Yaml	none	5	acc	0.4815	±	0.0341
- machine_learning	Yaml	none	5	acc	0.4821	±	0.0474

Groups	Version	Filter	n-shot	Metric	Value		Stderr
mmlu	N/A	none	0	acc	0.6320	±	0.1239
- humanities	N/A	none	5	acc	0.5953	±	0.1120
- other	N/A	none	5	acc	0.6904	±	0.1118
- social_sciences	N/A	none	5	acc	0.7400	±	0.0749
- stem	N/A	none	5	acc	0.5236	±	0.1294

StellaAthena commented 8 months ago

The scores for LLaMA and LLaMA 2 are generally considered irreproducible because they have custom undisclosed prompts.

fancyerii commented 8 months ago

thank you.

EleutherAI / lm-evaluation-harness

Llama 2 70b chat has lower score in mmlu than paper reported #1213