the score of Qwen/Qwen1.5-7B on mmlu is 1.5 point lower than the score reported on leaderboard.

shellhue commented 9 months ago

I cloned lm-evaluation-harness repo from main and followed the instruction to install. Then i evaluated the model Qwen/Qwen1.5-7B on mmlu by the command below. The output mmlu score is 60.43, but the reported mmlu score on open_llm_leaderboard is 61.97. is the difference normal? or i evaluate in a wrong way?

repo commit: b177c82c

the command i used to evaluate:

lm_eval --model hf \
    --model_args pretrained=Qwen/Qwen1.5-7B \
    --tasks mmlu \
    --device cuda:0 \
    --batch_size 8 \
    --num_fewshot 5

baberabb commented 9 months ago

Seems about right. Have you tried with --batch_size 1?

shellhue commented 9 months ago

after setting --batch_size 1, get 60.48, no big difference

baberabb commented 9 months ago

hmm. you could also check on the OpenLLM [branch] https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)) as well just to be sure. The command has a changed a bit, but it's on the ReadME and mmlu can be called by --tasks hendrycksTest*

haileyschoelkopf commented 9 months ago

@clefourrier how have you all been aggregating the MMLU subtasks' scores for the Open LLM Leaderboard?

We switched to weighting MMLU scores by number of docs per subtask, as in https://github.com/hendrycks/test/blob/4450500f923c49f1fb1dd3d99108a0bd9717b660/evaluate.py#L82-L99 -- (we hadn't previously been reporting an averaged MMLU score in v0.3.0 !)

baberabb commented 9 months ago

(average of all the results acc)

from here.

Looks like they do an unweighted average. Might be worth it to report both weighted and unweighted for MMLU (at least for the group as a whole)

shellhue commented 9 months ago

hmm. you could also check on the OpenLLM [branch] https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)) as well just to be sure. The command has a changed a bit, but it's on the ReadME and mmlu can be called by --tasks hendrycksTest*

I tried, got:	Task	Version	Metric	Value
hendrycksTest-abstract_algebra	1	acc	0.3600	±	0.0482
		acc_norm	0.3600	±	0.0482
hendrycksTest-anatomy	1	acc	0.5185	±	0.0432
		acc_norm	0.5185	±	0.0432
hendrycksTest-astronomy	1	acc	0.6842	±	0.0378
		acc_norm	0.6842	±	0.0378
hendrycksTest-business_ethics	1	acc	0.6600	±	0.0476
		acc_norm	0.6600	±	0.0476
hendrycksTest-clinical_knowledge	1	acc	0.6792	±	0.0287
		acc_norm	0.6792	±	0.0287
hendrycksTest-college_biology	1	acc	0.6736	±	0.0392
		acc_norm	0.6736	±	0.0392
hendrycksTest-college_chemistry	1	acc	0.4400	±	0.0499
		acc_norm	0.4400	±	0.0499
hendrycksTest-college_computer_science	1	acc	0.5800	±	0.0496
		acc_norm	0.5800	±	0.0496
hendrycksTest-college_mathematics	1	acc	0.3500	±	0.0479
		acc_norm	0.3500	±	0.0479
hendrycksTest-college_medicine	1	acc	0.6243	±	0.0369
		acc_norm	0.6243	±	0.0369
hendrycksTest-college_physics	1	acc	0.4020	±	0.0488
		acc_norm	0.4020	±	0.0488
hendrycksTest-computer_security	1	acc	0.7200	±	0.0451
		acc_norm	0.7200	±	0.0451
hendrycksTest-conceptual_physics	1	acc	0.5574	±	0.0325
		acc_norm	0.5574	±	0.0325
hendrycksTest-econometrics	1	acc	0.4561	±	0.0469
		acc_norm	0.4561	±	0.0469
hendrycksTest-electrical_engineering	1	acc	0.5586	±	0.0414
		acc_norm	0.5586	±	0.0414
hendrycksTest-elementary_mathematics	1	acc	0.4683	±	0.0257
		acc_norm	0.4683	±	0.0257
hendrycksTest-formal_logic	1	acc	0.4048	±	0.0439
		acc_norm	0.4048	±	0.0439
hendrycksTest-global_facts	1	acc	0.3700	±	0.0485
		acc_norm	0.3700	±	0.0485
hendrycksTest-high_school_biology	1	acc	0.7161	±	0.0256
		acc_norm	0.7161	±	0.0256
hendrycksTest-high_school_chemistry	1	acc	0.5320	±	0.0351
		acc_norm	0.5320	±	0.0351
hendrycksTest-high_school_computer_science	1	acc	0.7200	±	0.0451
		acc_norm	0.7200	±	0.0451
hendrycksTest-high_school_european_history	1	acc	0.7091	±	0.0355
		acc_norm	0.7091	±	0.0355
hendrycksTest-high_school_geography	1	acc	0.7828	±	0.0294
		acc_norm	0.7828	±	0.0294
hendrycksTest-high_school_government_and_politics	1	acc	0.8083	±	0.0284
		acc_norm	0.8083	±	0.0284
hendrycksTest-high_school_macroeconomics	1	acc	0.5821	±	0.0250
		acc_norm	0.5821	±	0.0250
hendrycksTest-high_school_mathematics	1	acc	0.3333	±	0.0287
		acc_norm	0.3333	±	0.0287
hendrycksTest-high_school_microeconomics	1	acc	0.6387	±	0.0312
		acc_norm	0.6387	±	0.0312
hendrycksTest-high_school_physics	1	acc	0.3510	±	0.0390
		acc_norm	0.3510	±	0.0390
hendrycksTest-high_school_psychology	1	acc	0.8092	±	0.0168
		acc_norm	0.8092	±	0.0168
hendrycksTest-high_school_statistics	1	acc	0.5324	±	0.0340
		acc_norm	0.5324	±	0.0340
hendrycksTest-high_school_us_history	1	acc	0.7794	±	0.0291
		acc_norm	0.7794	±	0.0291
hendrycksTest-high_school_world_history	1	acc	0.7890	±	0.0266
		acc_norm	0.7890	±	0.0266
hendrycksTest-human_aging	1	acc	0.6188	±	0.0326
		acc_norm	0.6188	±	0.0326
hendrycksTest-human_sexuality	1	acc	0.7176	±	0.0395
		acc_norm	0.7176	±	0.0395
hendrycksTest-international_law	1	acc	0.8099	±	0.0358
		acc_norm	0.8099	±	0.0358
hendrycksTest-jurisprudence	1	acc	0.7778	±	0.0402
		acc_norm	0.7778	±	0.0402
hendrycksTest-logical_fallacies	1	acc	0.6933	±	0.0362
		acc_norm	0.6933	±	0.0362
hendrycksTest-machine_learning	1	acc	0.4286	±	0.0470
		acc_norm	0.4286	±	0.0470
hendrycksTest-management	1	acc	0.7573	±	0.0425
		acc_norm	0.7573	±	0.0425
hendrycksTest-marketing	1	acc	0.8675	±	0.0222
		acc_norm	0.8675	±	0.0222
hendrycksTest-medical_genetics	1	acc	0.6900	±	0.0465
		acc_norm	0.6900	±	0.0465
hendrycksTest-miscellaneous	1	acc	0.7765	±	0.0149
		acc_norm	0.7765	±	0.0149
hendrycksTest-moral_disputes	1	acc	0.6416	±	0.0258
		acc_norm	0.6416	±	0.0258
hendrycksTest-moral_scenarios	1	acc	0.3151	±	0.0155
		acc_norm	0.3151	±	0.0155
hendrycksTest-nutrition	1	acc	0.6569	±	0.0272
		acc_norm	0.6569	±	0.0272
hendrycksTest-philosophy	1	acc	0.6688	±	0.0267
		acc_norm	0.6688	±	0.0267
hendrycksTest-prehistory	1	acc	0.6759	±	0.0260
		acc_norm	0.6759	±	0.0260
hendrycksTest-professional_accounting	1	acc	0.4433	±	0.0296
		acc_norm	0.4433	±	0.0296
hendrycksTest-professional_law	1	acc	0.4381	±	0.0127
		acc_norm	0.4381	±	0.0127
hendrycksTest-professional_medicine	1	acc	0.5919	±	0.0299
		acc_norm	0.5919	±	0.0299
hendrycksTest-professional_psychology	1	acc	0.5817	±	0.0200
		acc_norm	0.5817	±	0.0200
hendrycksTest-public_relations	1	acc	0.6091	±	0.0467
		acc_norm	0.6091	±	0.0467
hendrycksTest-security_studies	1	acc	0.7265	±	0.0285
		acc_norm	0.7265	±	0.0285
hendrycksTest-sociology	1	acc	0.8259	±	0.0268
		acc_norm	0.8259	±	0.0268
hendrycksTest-us_foreign_policy	1	acc	0.8400	±	0.0368
		acc_norm	0.8400	±	0.0368
hendrycksTest-virology	1	acc	0.4940	±	0.0389
		acc_norm	0.4940	±	0.0389
hendrycksTest-world_religions	1	acc	0.7661	±	0.0325
		acc_norm	0.7661	±	0.0325

It is hard to know which score is mmlu score.

clefourrier commented 9 months ago

@shellhue you need to average all of these scores to get the score we have on the Open LLM Leaderboard.

clefourrier commented 9 months ago

@haileyschoelkopf Since we pinned a specific version of the harness, we've tried to not edit any of the mechanisms to allow reproducibility.

shellhue commented 9 months ago

average all of these scores

the average score is: 61.4080701

It is very close to 61.97 reported on the leaderboard.

clefourrier commented 9 months ago

Then I think this solves your issue :)

EleutherAI / lm-evaluation-harness

the score of Qwen/Qwen1.5-7B on mmlu is 1.5 point lower than the score reported on leaderboard. #1525