Closed l-k-11235 closed 1 year ago
Hello, I don't reproduce the llama7b scores https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/llama7b-onmt.txt
With this config:
# transforms transforms: [sentencepiece] # Subword src_subword_model: "/big_llms/llama/tokenizer.model" tgt_subword_model: "/big_llms/llama/tokenizer.model" # Model info model: "../checkpoints/llama_7B.pt" # Inference seed: 42 max_length: 1 gpu: 0 batch_type: sents batch_size: 1 beam_size: 1 report_time: true
I get these scores:
ACC-abstract_algebra: 0.2600 ACC-anatomy: 0.3704 ACC-astronomy: 0.3487 ACC-business_ethics: 0.4300 ACC-clinical_knowledge: 0.3660 ACC-college_biology: 0.3819 ACC-college_chemistry: 0.3100 ACC-college_computer_science: 0.2900 ACC-college_mathematics: 0.3500 ACC-college_medicine: 0.3237 ACC-college_physics: 0.2255 ACC-computer_security: 0.4600 ACC-conceptual_physics: 0.3745 ACC-econometrics: 0.2632 ACC-electrical_engineering: 0.2345 ACC-elementary_mathematics: 0.2646 ACC-formal_logic: 0.2619 ACC-global_facts: 0.3000 ACC-high_school_biology: 0.3355 ACC-high_school_chemistry: 0.2956 ACC-high_school_computer_science: 0.3300 ACC-high_school_european_history: 0.4727 ACC-high_school_geography: 0.3333 ACC-high_school_government_and_politics: 0.4508 ACC-high_school_macroeconomics: 0.3462 ACC-high_school_mathematics: 0.2556 ACC-high_school_microeconomics: 0.3403 ACC-high_school_physics: 0.2649 ACC-high_school_psychology: 0.4862 ACC-high_school_statistics: 0.3333 ACC-high_school_us_history: 0.3284 ACC-high_school_world_history: 0.4262 ACC-human_aging: 0.3991 ACC-human_sexuality: 0.3435 ACC-international_law: 0.5041 ACC-jurisprudence: 0.4167 ACC-logical_fallacies: 0.4233 ACC-machine_learning: 0.2768 ACC-management: 0.3301 ACC-marketing: 0.4615 ACC-medical_genetics: 0.3700 ACC-miscellaneous: 0.4266 ACC-moral_disputes: 0.4075 ACC-moral_scenarios: 0.2425 ACC-nutrition: 0.4020 ACC-philosophy: 0.4051 ACC-prehistory: 0.3580 ACC-professional_accounting: 0.2695 ACC-professional_law: 0.2992 ACC-professional_medicine: 0.4228 ACC-professional_psychology: 0.3562 ACC-public_relations: 0.4182 ACC-security_studies: 0.3306 ACC-sociology: 0.4726 ACC-us_foreign_policy: 0.4300 ACC-virology: 0.3313 ACC-world_religions: 0.4912 ACC-all: 0.3536
Do you have any ideas to explain this discrepancy?
I am getting another set of results with my second GPU, which means the FP16 quantization is GPU dependent.
Hello, I don't reproduce the llama7b scores https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/llama7b-onmt.txt
With this config:
I get these scores:
Do you have any ideas to explain this discrepancy?