mmlu run_eval.py: difference in evaluation results between run_eval.py(average_acc: 27.4) and normal results: (llama2-hf: 41.8)

There is a significant difference between the evaluation results obtained using the run_eval.py script and the normal results. The obtained evaluation results are as follows: { "average_acc": 0.27474718701039735, "subcat_acc": { "math": 0.21898496240601503, "health": 0.275, "physics": 0.25, "business": 0.33638443935926776, "biology": 0.2753303964757709, "chemistry": 0.18151815181518152, "computer science": 0.308252427184466, "economics": 0.24258760107816713, "engineering": 0.2482758620689655, "philosophy": 0.26192842942345923, "other": 0.2944206008583691, "history": 0.310752688172043, "geography": 0.26262626262626265, "politics": 0.2978395061728395, "psychology": 0.29818496110630943, "culture": 0.3253012048192771, "law": 0.2762336925694838 }, "cat_acc": { "STEM": 0.243870112657389, "humanities": 0.2769394261424017, "social sciences": 0.2853428664283393, "other (business, health, misc.)": 0.2902529302899445 } }

allenai / open-instruct

mmlu run_eval.py: difference in evaluation results between run_eval.py(average_acc: 27.4) and normal results: (llama2-hf: 41.8) #82