Issue about reproducing results in some datasets

TIGER-AI-Lab / MAmmoTH2

Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]

https://tiger-ai-lab.github.io/MAmmoTH2/

MIT License

124 stars 9 forks source link

Open ToheartZhang opened 5 months ago

ToheartZhang commented 5 months ago

Thanks for your great work! I clone the math_eval directory and run run_7B_plus.sh directly, and find some performance gaps in some datasets.

Model	TheoremQA	GPQA	MMLU STEM	BBH	ARC-C	MATH	GSM8k
MAmmoTH2-7B-Plus (reported)	29.2	36.8	65.7	63.1	83	45	84.7
MAmmoTH2-7B-Plus (reproduced)	26.75	31.31	64.29	63.6	83.02	44.32	83.4

My environment is:

vllm                      0.2.6
torch                     2.1.2
transformers              4.40.0

Am I missing something? Thanks for your help!

wenhuchen commented 5 months ago

wenhuchen commented 5 months ago

It seems that this is mainly due to your lower version of vllm. Try to upgrade that to reproduce it. Thanks!

ToheartZhang commented 5 months ago

Thanks for your help! Here are my updated results with the new vllm version. I think the GPQA dataset is a little unstable.

Model	TheoremQA	GPQA	MMLU STEM	BBH	ARC-C	MATH	GSM8k
MAmmoTH2-7B-Plus (reported)	29.2	36.8	65.7	63.1	83	45	84.7
MAmmoTH2-7B-Plus (reproduced)	28.88	30.81	64.58	63.05	82.68	44.42	85.06

wenhuchen commented 5 months ago

Thanks! Would you mind trying our updated ckpt. It's getting better results. Please refer to https://huggingface.co/TIGER-Lab/MAmmoTH2-7B-Plus.