Open ToheartZhang opened 5 months ago
It seems that this is mainly due to your lower version of vllm. Try to upgrade that to reproduce it. Thanks!
Thanks for your help! Here are my updated results with the new vllm version. I think the GPQA dataset is a little unstable.
Model | TheoremQA | GPQA | MMLU STEM | BBH | ARC-C | MATH | GSM8k |
---|---|---|---|---|---|---|---|
MAmmoTH2-7B-Plus (reported) | 29.2 | 36.8 | 65.7 | 63.1 | 83 | 45 | 84.7 |
MAmmoTH2-7B-Plus (reproduced) | 28.88 | 30.81 | 64.58 | 63.05 | 82.68 | 44.42 | 85.06 |
Thanks! Would you mind trying our updated ckpt. It's getting better results. Please refer to https://huggingface.co/TIGER-Lab/MAmmoTH2-7B-Plus.
Thanks for your great work! I clone the
math_eval
directory and runrun_7B_plus.sh
directly, and find some performance gaps in some datasets.My environment is:
Am I missing something? Thanks for your help!