baichuan-inc / Baichuan-13B

A 13B large language model developed by Baichuan Intelligent Technology
https://huggingface.co/baichuan-inc/Baichuan-13B-Chat
Apache License 2.0
2.98k stars 237 forks source link

Vicuna-13B performs unexpectedly poor on all your evaluations. Did you use delta weights directly without merging? #10

Closed Ying1123 closed 1 year ago

Ying1123 commented 1 year ago

In your MMLU evaluation, the accuracy of Vicuna is only 24.9%, which is the same as a random guess. This is obviously wrong. Did you directly use our delta weights (https://huggingface.co/lmsys/vicuna-13b-delta-v1.1) without merging them with the base weights?

If you correctly use our latest weights (https://github.com/lm-sys/FastChat#vicuna-weights), you should get an MMLU accuracy about 52.1 (https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

xiangrongzeng commented 1 year ago

We found there are some mismatches when we merging vicuna-13b-delta-v1.1 with the base weights. We are re-evaluating vicuna-13b-v1.3 and will update the results later. Thank you for your reminder.

xiangrongzeng commented 1 year ago

We evaluated vicuna-13b-v1.3 on MMLU and the average accuracy is 52.0. Vicuna-13B indeed performs well :). We provide the score of each subject for addition.

Average accuracy 0.290 - abstract_algebra Average accuracy 0.511 - anatomy Average accuracy 0.474 - astronomy Average accuracy 0.540 - business_ethics Average accuracy 0.517 - clinical_knowledge Average accuracy 0.562 - college_biology Average accuracy 0.380 - college_chemistry Average accuracy 0.480 - college_computer_science Average accuracy 0.310 - college_mathematics Average accuracy 0.405 - college_medicine Average accuracy 0.275 - college_physics Average accuracy 0.670 - computer_security Average accuracy 0.387 - conceptual_physics Average accuracy 0.281 - econometrics Average accuracy 0.483 - electrical_engineering Average accuracy 0.275 - elementary_mathematics Average accuracy 0.365 - formal_logic Average accuracy 0.290 - global_facts Average accuracy 0.577 - high_school_biology Average accuracy 0.424 - high_school_chemistry Average accuracy 0.540 - high_school_computer_science Average accuracy 0.685 - high_school_european_history Average accuracy 0.677 - high_school_geography Average accuracy 0.767 - high_school_government_and_politics Average accuracy 0.464 - high_school_macroeconomics Average accuracy 0.256 - high_school_mathematics Average accuracy 0.483 - high_school_microeconomics Average accuracy 0.272 - high_school_physics Average accuracy 0.714 - high_school_psychology Average accuracy 0.380 - high_school_statistics Average accuracy 0.730 - high_school_us_history Average accuracy 0.713 - high_school_world_history Average accuracy 0.587 - human_aging Average accuracy 0.679 - human_sexuality Average accuracy 0.694 - international_law Average accuracy 0.611 - jurisprudence Average accuracy 0.663 - logical_fallacies Average accuracy 0.429 - machine_learning Average accuracy 0.709 - management Average accuracy 0.825 - marketing Average accuracy 0.580 - medical_genetics Average accuracy 0.724 - miscellaneous Average accuracy 0.587 - moral_disputes Average accuracy 0.257 - moral_scenarios Average accuracy 0.601 - nutrition Average accuracy 0.588 - philosophy Average accuracy 0.583 - prehistory Average accuracy 0.418 - professional_accounting Average accuracy 0.425 - professional_law Average accuracy 0.489 - professional_medicine Average accuracy 0.542 - professional_psychology Average accuracy 0.600 - public_relations Average accuracy 0.616 - security_studies Average accuracy 0.746 - sociology Average accuracy 0.740 - us_foreign_policy Average accuracy 0.464 - virology Average accuracy 0.789 - world_religions

Average accuracy 0.404 - STEM Average accuracy 0.495 - humanities Average accuracy 0.605 - social sciences Average accuracy 0.584 - other (business, health, misc.) Average accuracy: 0.520

Ying1123 commented 1 year ago

Thanks for the update!