TIGER-AI-Lab / MAmmoTH2

Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
https://tiger-ai-lab.github.io/MAmmoTH2/
MIT License
124 stars 9 forks source link

Issue about reproducing results in some datasets #2

Open ToheartZhang opened 5 months ago

ToheartZhang commented 5 months ago

Thanks for your great work! I clone the math_eval directory and run run_7B_plus.sh directly, and find some performance gaps in some datasets.

Model TheoremQA GPQA MMLU STEM BBH ARC-C MATH GSM8k
MAmmoTH2-7B-Plus (reported) 29.2 36.8 65.7 63.1 83 45 84.7
MAmmoTH2-7B-Plus (reproduced) 26.75 31.31 64.29 63.6 83.02 44.32 83.4

My environment is:

vllm                      0.2.6
torch                     2.1.2
transformers              4.40.0

Am I missing something? Thanks for your help!

wenhuchen commented 5 months ago

Please check out https://github.com/TIGER-AI-Lab/MAmmoTH2/blob/main/math_eval/requirements.txt.

wenhuchen commented 5 months ago

It seems that this is mainly due to your lower version of vllm. Try to upgrade that to reproduce it. Thanks!

ToheartZhang commented 5 months ago

Thanks for your help! Here are my updated results with the new vllm version. I think the GPQA dataset is a little unstable.

Model TheoremQA GPQA MMLU STEM BBH ARC-C MATH GSM8k
MAmmoTH2-7B-Plus (reported) 29.2 36.8 65.7 63.1 83 45 84.7
MAmmoTH2-7B-Plus (reproduced) 28.88 30.81 64.58 63.05 82.68 44.42 85.06


wenhuchen commented 5 months ago

Thanks! Would you mind trying our updated ckpt. It's getting better results. Please refer to https://huggingface.co/TIGER-Lab/MAmmoTH2-7B-Plus.