EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.54k stars 1.73k forks source link

the score of Qwen/Qwen1.5-7B on mmlu is 1.5 point lower than the score reported on leaderboard. #1525

Closed shellhue closed 6 months ago

shellhue commented 6 months ago

I cloned lm-evaluation-harness repo from main and followed the instruction to install. Then i evaluated the model Qwen/Qwen1.5-7B on mmlu by the command below. The output mmlu score is 60.43, but the reported mmlu score on open_llm_leaderboard is 61.97. is the difference normal? or i evaluate in a wrong way?

repo commit: b177c82c

the command i used to evaluate:

lm_eval --model hf \
    --model_args pretrained=Qwen/Qwen1.5-7B \
    --tasks mmlu \
    --device cuda:0 \
    --batch_size 8 \
    --num_fewshot 5
baberabb commented 6 months ago

Seems about right. Have you tried with --batch_size 1?

shellhue commented 6 months ago

after setting --batch_size 1, get 60.48, no big difference

baberabb commented 6 months ago

hmm. you could also check on the OpenLLM [branch] https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)) as well just to be sure. The command has a changed a bit, but it's on the ReadME and mmlu can be called by --tasks hendrycksTest*

haileyschoelkopf commented 6 months ago

@clefourrier how have you all been aggregating the MMLU subtasks' scores for the Open LLM Leaderboard?

We switched to weighting MMLU scores by number of docs per subtask, as in https://github.com/hendrycks/test/blob/4450500f923c49f1fb1dd3d99108a0bd9717b660/evaluate.py#L82-L99 -- (we hadn't previously been reporting an averaged MMLU score in v0.3.0 !)

baberabb commented 6 months ago

(average of all the results acc)

from here.

Looks like they do an unweighted average. Might be worth it to report both weighted and unweighted for MMLU (at least for the group as a whole)

shellhue commented 6 months ago

hmm. you could also check on the OpenLLM [branch] https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)) as well just to be sure. The command has a changed a bit, but it's on the ReadME and mmlu can be called by --tasks hendrycksTest*

I tried, got: Task Version Metric Value Stderr
hendrycksTest-abstract_algebra 1 acc 0.3600 ± 0.0482
acc_norm 0.3600 ± 0.0482
hendrycksTest-anatomy 1 acc 0.5185 ± 0.0432
acc_norm 0.5185 ± 0.0432
hendrycksTest-astronomy 1 acc 0.6842 ± 0.0378
acc_norm 0.6842 ± 0.0378
hendrycksTest-business_ethics 1 acc 0.6600 ± 0.0476
acc_norm 0.6600 ± 0.0476
hendrycksTest-clinical_knowledge 1 acc 0.6792 ± 0.0287
acc_norm 0.6792 ± 0.0287
hendrycksTest-college_biology 1 acc 0.6736 ± 0.0392
acc_norm 0.6736 ± 0.0392
hendrycksTest-college_chemistry 1 acc 0.4400 ± 0.0499
acc_norm 0.4400 ± 0.0499
hendrycksTest-college_computer_science 1 acc 0.5800 ± 0.0496
acc_norm 0.5800 ± 0.0496
hendrycksTest-college_mathematics 1 acc 0.3500 ± 0.0479
acc_norm 0.3500 ± 0.0479
hendrycksTest-college_medicine 1 acc 0.6243 ± 0.0369
acc_norm 0.6243 ± 0.0369
hendrycksTest-college_physics 1 acc 0.4020 ± 0.0488
acc_norm 0.4020 ± 0.0488
hendrycksTest-computer_security 1 acc 0.7200 ± 0.0451
acc_norm 0.7200 ± 0.0451
hendrycksTest-conceptual_physics 1 acc 0.5574 ± 0.0325
acc_norm 0.5574 ± 0.0325
hendrycksTest-econometrics 1 acc 0.4561 ± 0.0469
acc_norm 0.4561 ± 0.0469
hendrycksTest-electrical_engineering 1 acc 0.5586 ± 0.0414
acc_norm 0.5586 ± 0.0414
hendrycksTest-elementary_mathematics 1 acc 0.4683 ± 0.0257
acc_norm 0.4683 ± 0.0257
hendrycksTest-formal_logic 1 acc 0.4048 ± 0.0439
acc_norm 0.4048 ± 0.0439
hendrycksTest-global_facts 1 acc 0.3700 ± 0.0485
acc_norm 0.3700 ± 0.0485
hendrycksTest-high_school_biology 1 acc 0.7161 ± 0.0256
acc_norm 0.7161 ± 0.0256
hendrycksTest-high_school_chemistry 1 acc 0.5320 ± 0.0351
acc_norm 0.5320 ± 0.0351
hendrycksTest-high_school_computer_science 1 acc 0.7200 ± 0.0451
acc_norm 0.7200 ± 0.0451
hendrycksTest-high_school_european_history 1 acc 0.7091 ± 0.0355
acc_norm 0.7091 ± 0.0355
hendrycksTest-high_school_geography 1 acc 0.7828 ± 0.0294
acc_norm 0.7828 ± 0.0294
hendrycksTest-high_school_government_and_politics 1 acc 0.8083 ± 0.0284
acc_norm 0.8083 ± 0.0284
hendrycksTest-high_school_macroeconomics 1 acc 0.5821 ± 0.0250
acc_norm 0.5821 ± 0.0250
hendrycksTest-high_school_mathematics 1 acc 0.3333 ± 0.0287
acc_norm 0.3333 ± 0.0287
hendrycksTest-high_school_microeconomics 1 acc 0.6387 ± 0.0312
acc_norm 0.6387 ± 0.0312
hendrycksTest-high_school_physics 1 acc 0.3510 ± 0.0390
acc_norm 0.3510 ± 0.0390
hendrycksTest-high_school_psychology 1 acc 0.8092 ± 0.0168
acc_norm 0.8092 ± 0.0168
hendrycksTest-high_school_statistics 1 acc 0.5324 ± 0.0340
acc_norm 0.5324 ± 0.0340
hendrycksTest-high_school_us_history 1 acc 0.7794 ± 0.0291
acc_norm 0.7794 ± 0.0291
hendrycksTest-high_school_world_history 1 acc 0.7890 ± 0.0266
acc_norm 0.7890 ± 0.0266
hendrycksTest-human_aging 1 acc 0.6188 ± 0.0326
acc_norm 0.6188 ± 0.0326
hendrycksTest-human_sexuality 1 acc 0.7176 ± 0.0395
acc_norm 0.7176 ± 0.0395
hendrycksTest-international_law 1 acc 0.8099 ± 0.0358
acc_norm 0.8099 ± 0.0358
hendrycksTest-jurisprudence 1 acc 0.7778 ± 0.0402
acc_norm 0.7778 ± 0.0402
hendrycksTest-logical_fallacies 1 acc 0.6933 ± 0.0362
acc_norm 0.6933 ± 0.0362
hendrycksTest-machine_learning 1 acc 0.4286 ± 0.0470
acc_norm 0.4286 ± 0.0470
hendrycksTest-management 1 acc 0.7573 ± 0.0425
acc_norm 0.7573 ± 0.0425
hendrycksTest-marketing 1 acc 0.8675 ± 0.0222
acc_norm 0.8675 ± 0.0222
hendrycksTest-medical_genetics 1 acc 0.6900 ± 0.0465
acc_norm 0.6900 ± 0.0465
hendrycksTest-miscellaneous 1 acc 0.7765 ± 0.0149
acc_norm 0.7765 ± 0.0149
hendrycksTest-moral_disputes 1 acc 0.6416 ± 0.0258
acc_norm 0.6416 ± 0.0258
hendrycksTest-moral_scenarios 1 acc 0.3151 ± 0.0155
acc_norm 0.3151 ± 0.0155
hendrycksTest-nutrition 1 acc 0.6569 ± 0.0272
acc_norm 0.6569 ± 0.0272
hendrycksTest-philosophy 1 acc 0.6688 ± 0.0267
acc_norm 0.6688 ± 0.0267
hendrycksTest-prehistory 1 acc 0.6759 ± 0.0260
acc_norm 0.6759 ± 0.0260
hendrycksTest-professional_accounting 1 acc 0.4433 ± 0.0296
acc_norm 0.4433 ± 0.0296
hendrycksTest-professional_law 1 acc 0.4381 ± 0.0127
acc_norm 0.4381 ± 0.0127
hendrycksTest-professional_medicine 1 acc 0.5919 ± 0.0299
acc_norm 0.5919 ± 0.0299
hendrycksTest-professional_psychology 1 acc 0.5817 ± 0.0200
acc_norm 0.5817 ± 0.0200
hendrycksTest-public_relations 1 acc 0.6091 ± 0.0467
acc_norm 0.6091 ± 0.0467
hendrycksTest-security_studies 1 acc 0.7265 ± 0.0285
acc_norm 0.7265 ± 0.0285
hendrycksTest-sociology 1 acc 0.8259 ± 0.0268
acc_norm 0.8259 ± 0.0268
hendrycksTest-us_foreign_policy 1 acc 0.8400 ± 0.0368
acc_norm 0.8400 ± 0.0368
hendrycksTest-virology 1 acc 0.4940 ± 0.0389
acc_norm 0.4940 ± 0.0389
hendrycksTest-world_religions 1 acc 0.7661 ± 0.0325
acc_norm 0.7661 ± 0.0325

It is hard to know which score is mmlu score.

clefourrier commented 6 months ago

@shellhue you need to average all of these scores to get the score we have on the Open LLM Leaderboard.

clefourrier commented 6 months ago

@haileyschoelkopf Since we pinned a specific version of the harness, we've tried to not edit any of the mechanisms to allow reproducibility.

shellhue commented 6 months ago

average all of these scores

the average score is: 61.4080701

It is very close to 61.97 reported on the leaderboard.

clefourrier commented 6 months ago

Then I think this solves your issue :)