【Help】使用lm-evaluation-harness评估，ChatGLM2-6B在CEval上准确率很低？

Kevin-KWH commented 10 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

如题，使用lm-evaluation-harness评估，ChatGLM2-6B在CEval上准确率很低？只有20%多，和官宣的差别太大。不知道是什原因？我是使用 https://github.com/EleutherAI/lm-evaluation-harness 跑的，由于Ceval的test data没有公布答案，所以我使用的 1346条val data，zero-shot跑出来的acc是0.2422，five-shot跑出来的acc是0.2835。

为排除ceval val data数据量太少可能导致的acc低的问题，我又同样跑了CMMLU，CMMLU test data公布了答案，一共 11582条，zero-shot和five-shot的acc同样很低，和Ceval val data的结果差不多。

但同样，我使用 https://github.com/EleutherAI/lm-evaluation-harness 跑 Qwen-14B 和 Baichuan2-13B，在 CEval 和 CMMLU 上都拿到了 0.6x 和 0.5x的 acc。

所以，我不知道问题出在了哪里呢？

如果有人知道我哪里做错了，请帮忙告知，感谢！

Expected Behavior

No response

Steps To Reproduce

clone https://github.com/EleutherAI/lm-evaluation-harness
run python main.py --model hf-causal \ --model_args pretrained=THUDM/chatglm2-6b \ --tasks Ceval-valid-*

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

wangxingjun778 commented 10 months ago

same problem

YaoJiawei329 commented 3 months ago

+1

THUDM / ChatGLM2-6B