Qwen2-1.5B humaneval 测试集结果低了 10 个点

QwenLM / Qwen2.5-Coder

Qwen2.5-Coder is the code version of Qwen2.5, the large language model series developed by Qwen team, Alibaba Cloud.

819 stars 74 forks source link

Qwen2-1.5B humaneval 测试集结果低了 10 个点 #88

Closed SefaZeng closed 2 months ago

SefaZeng commented 2 months ago

Load from ground-truth from /root/.cache/evalplus/84f4b93a1270b492e4c54d5212da7a5b.pkl
Reading samples...
humaneval (base tests)
pass@1: 0.201
humaneval+ (base + extra tests)
pass@1: 0.183

直接用 Qwen2-1.5B，测试出来结果比技术报告的结果要低 10 个点，而且 Qwen1.5 的测试结果也很低。

cyente commented 2 months ago

https://evalplus.github.io/leaderboard.html https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard

Please refer to some third-party benchmarks, similar to the ones mentioned above, to check for any differences.

SefaZeng commented 2 months ago

https://evalplus.github.io/leaderboard.html https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard

Please refer to some third-party benchmarks, similar to the ones mentioned above, to check for any differences.

Hi, could this repo be used to evaluate the Qwen1.5 models?

cyente commented 2 months ago

We recommend you use qwen2. We have used and tested the evaluation on qwen2 and can basically confirm that everything is aligned correctly.