Closed MozerWang closed 1 year ago
did you check line by line ? do you have some "0" scores?
yes, I am sure that there were no 0 scores in 57 subjects. And i get the same score (0.4996) with https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU
What's more, I think that both of two ways use same prompt and greedy decoding
the details are as below:
ACC-abstract_algebra: 0.2900
ACC-anatomy: 0.4963
ACC-astronomy: 0.4868
ACC-business_ethics: 0.5800
ACC-clinical_knowledge: 0.5547
ACC-college_biology: 0.5486
ACC-college_chemistry: 0.2900
ACC-college_computer_science: 0.4700
ACC-college_mathematics: 0.3300
ACC-college_medicine: 0.4220
ACC-college_physics: 0.2059
ACC-computer_security: 0.5600
ACC-conceptual_physics: 0.4213
ACC-econometrics: 0.2719
ACC-electrical_engineering: 0.5172
ACC-elementary_mathematics: 0.3095
ACC-formal_logic: 0.2540
ACC-global_facts: 0.3300
ACC-high_school_biology: 0.5935
ACC-high_school_chemistry: 0.3547
ACC-high_school_computer_science: 0.4900
ACC-high_school_european_history: 0.5455
ACC-high_school_geography: 0.6717
ACC-high_school_government_and_politics: 0.7047
ACC-high_school_macroeconomics: 0.4974
ACC-high_school_mathematics: 0.2593
ACC-high_school_microeconomics: 0.4706
ACC-high_school_physics: 0.3113
ACC-high_school_psychology: 0.6404
ACC-high_school_statistics: 0.4028
ACC-high_school_us_history: 0.6225
ACC-high_school_world_history: 0.5274
ACC-human_aging: 0.6323
ACC-human_sexuality: 0.6183
ACC-international_law: 0.6612
ACC-jurisprudence: 0.6296
ACC-logical_fallacies: 0.5828
ACC-machine_learning: 0.2857
ACC-management: 0.7476
ACC-marketing: 0.7821
ACC-medical_genetics: 0.6300
ACC-miscellaneous: 0.6909
ACC-moral_disputes: 0.5867
ACC-moral_scenarios: 0.2983
ACC-nutrition: 0.5817
ACC-philosophy: 0.6013
ACC-prehistory: 0.5895
ACC-professional_accounting: 0.4255
ACC-professional_law: 0.3781
ACC-professional_medicine: 0.4632
ACC-professional_psychology: 0.4788
ACC-public_relations: 0.6455
ACC-security_studies: 0.5673
ACC-sociology: 0.7114
ACC-us_foreign_policy: 0.7600
ACC-virology: 0.3976
ACC-world_religions: 0.7953
ACC-all: 0.4994
total run time 4356.72
what batch_size did you use?
same as https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU
batch_size=8
does batch_size matter?
yes it does, use 1
Why batch_size impacts the inference result? Could u give me some idea? Thanks
Our inference function is as below:
def batch_infer(model, tokenizer, prompts):
batch_size = 8
answers = []
for batch_input in tqdm(batch_split(prompts, batch_size)):
encode_inputs = prepare_input(tokenizer, batch_input)
outputs = model.generate(**encode_inputs, max_new_tokens=1, pad_token_id=tokenizer.pad_token_id)
answers.extend(tokenizer.batch_decode(outputs, skip_special_tokens=True))
answers = [answer[-1] for answer in answers]
return answers
wow! It's true that the result is greatly improved (0.4994 -> 0.5594 ) by setting batch_size = 1
I'm so wondering why the batch_size can make a great difference on llm evaluation!
ACC-abstract_algebra: 0.3600
ACC-anatomy: 0.5630
ACC-astronomy: 0.5855
ACC-business_ethics: 0.5800
ACC-clinical_knowledge: 0.5962
ACC-college_biology: 0.6528
ACC-college_chemistry: 0.4300
ACC-college_computer_science: 0.5100
ACC-college_mathematics: 0.3800
ACC-college_medicine: 0.4971
ACC-college_physics: 0.2745
ACC-computer_security: 0.6300
ACC-conceptual_physics: 0.4383
ACC-econometrics: 0.3421
ACC-electrical_engineering: 0.5379
ACC-elementary_mathematics: 0.3148
ACC-formal_logic: 0.3175
ACC-global_facts: 0.3300
ACC-high_school_biology: 0.6839
ACC-high_school_chemistry: 0.4384
ACC-high_school_computer_science: 0.6100
ACC-high_school_european_history: 0.6848
ACC-high_school_geography: 0.7475
ACC-high_school_government_and_politics: 0.7668
ACC-high_school_macroeconomics: 0.5795
ACC-high_school_mathematics: 0.3148
ACC-high_school_microeconomics: 0.5714
ACC-high_school_physics: 0.3046
ACC-high_school_psychology: 0.7743
ACC-high_school_statistics: 0.4861
ACC-high_school_us_history: 0.7451
ACC-high_school_world_history: 0.7215
ACC-human_aging: 0.7040
ACC-human_sexuality: 0.7176
ACC-international_law: 0.6694
ACC-jurisprudence: 0.7037
ACC-logical_fallacies: 0.6564
ACC-machine_learning: 0.3214
ACC-management: 0.7573
ACC-marketing: 0.8120
ACC-medical_genetics: 0.6600
ACC-miscellaneous: 0.7586
ACC-moral_disputes: 0.6532
ACC-moral_scenarios: 0.2570
ACC-nutrition: 0.6471
ACC-philosophy: 0.6559
ACC-prehistory: 0.6049
ACC-professional_accounting: 0.4220
ACC-professional_law: 0.4276
ACC-professional_medicine: 0.6250
ACC-professional_psychology: 0.5523
ACC-public_relations: 0.6455
ACC-security_studies: 0.6735
ACC-sociology: 0.7910
ACC-us_foreign_policy: 0.8400
ACC-virology: 0.4819
ACC-world_religions: 0.7836
ACC-all: 0.5594
total run time 13467.72
I think this is a known issue but unsure if related to this: https://openreview.net/forum?id=9MDjKb9lGi reason might be FP16 inference and long token lengths.
Thank you very much!!!
Thanks for this work!
I also use the evaluation script from the https://github.com/FranxYao/chain-of-thought-hub/MMLU/run_mmlu_open_source.py
I got the same result with original repo【Falcon40b (FP16) Acc-avg = 0.4996】, but u get 0.5499 which is near with the huggingface result (0.57).
What's the difference between two ways? I would appreciate it if could u give me some idea!!!