I don't reproduce llama7b scores in the mmlu_benchmark

Hello, I don't reproduce the llama7b scores https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/llama7b-onmt.txt

With this config:

# transforms
transforms: [sentencepiece]

# Subword 
src_subword_model: "/big_llms/llama/tokenizer.model"
tgt_subword_model: "/big_llms/llama/tokenizer.model"

# Model info
model: "../checkpoints/llama_7B.pt"

# Inference
seed: 42
max_length: 1
gpu: 0
batch_type: sents
batch_size: 1
beam_size: 1
report_time: true

I get these scores:

ACC-abstract_algebra: 0.2600
ACC-anatomy: 0.3704
ACC-astronomy: 0.3487
ACC-business_ethics: 0.4300
ACC-clinical_knowledge: 0.3660
ACC-college_biology: 0.3819
ACC-college_chemistry: 0.3100
ACC-college_computer_science: 0.2900
ACC-college_mathematics: 0.3500
ACC-college_medicine: 0.3237
ACC-college_physics: 0.2255
ACC-computer_security: 0.4600
ACC-conceptual_physics: 0.3745
ACC-econometrics: 0.2632
ACC-electrical_engineering: 0.2345
ACC-elementary_mathematics: 0.2646
ACC-formal_logic: 0.2619
ACC-global_facts: 0.3000
ACC-high_school_biology: 0.3355
ACC-high_school_chemistry: 0.2956
ACC-high_school_computer_science: 0.3300
ACC-high_school_european_history: 0.4727
ACC-high_school_geography: 0.3333
ACC-high_school_government_and_politics: 0.4508
ACC-high_school_macroeconomics: 0.3462
ACC-high_school_mathematics: 0.2556
ACC-high_school_microeconomics: 0.3403
ACC-high_school_physics: 0.2649
ACC-high_school_psychology: 0.4862
ACC-high_school_statistics: 0.3333
ACC-high_school_us_history: 0.3284
ACC-high_school_world_history: 0.4262
ACC-human_aging: 0.3991
ACC-human_sexuality: 0.3435
ACC-international_law: 0.5041
ACC-jurisprudence: 0.4167
ACC-logical_fallacies: 0.4233
ACC-machine_learning: 0.2768
ACC-management: 0.3301
ACC-marketing: 0.4615
ACC-medical_genetics: 0.3700
ACC-miscellaneous: 0.4266
ACC-moral_disputes: 0.4075
ACC-moral_scenarios: 0.2425
ACC-nutrition: 0.4020
ACC-philosophy: 0.4051
ACC-prehistory: 0.3580
ACC-professional_accounting: 0.2695
ACC-professional_law: 0.2992
ACC-professional_medicine: 0.4228
ACC-professional_psychology: 0.3562
ACC-public_relations: 0.4182
ACC-security_studies: 0.3306
ACC-sociology: 0.4726
ACC-us_foreign_policy: 0.4300
ACC-virology: 0.3313
ACC-world_religions: 0.4912
ACC-all: 0.3536

Do you have any ideas to explain this discrepancy?

OpenNMT / OpenNMT-py

I don't reproduce llama7b scores in the mmlu_benchmark #2423