baichuan-inc / Baichuan-7B

A large-scale 7B pretraining language model developed by BaiChuan-Inc.
https://huggingface.co/baichuan-inc/baichuan-7B
Apache License 2.0
5.67k stars 506 forks source link

lm-evaluation-harness中文项目能力测试结果,对比WizardLM[Question] #54

Open ishotoli opened 1 year ago

ishotoli commented 1 year ago

Required prerequisites

Questions

感谢百川团队的贡献,为了对比 baichuan-7B 的中文能力,我选择了 lm-evaluation-harness 当中的中文测试项目 xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh,mgsm_zh,其中xwinograd_zh,xnli_zh,xcopa_zh,xstory_cloze_zh倾向于推理,mgsm_zh倾向于数学。我进行了两次测试,一次是num_fewshot为0,一次num_fewshot为5。需要提到的是因为 lm-evaluation-harness 默认不支持tokenizer的trust_remote_code,为了运行起来不得不小小hack了一下,其余均保持原样。

结果如下: hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0360 ± 0.0118
xcopa_zh 0 acc 0.6700 ± 0.0210
xstory_cloze_zh 0 acc 0.6320 ± 0.0124
xwinograd_zh 0 acc 0.7857 ± 0.0183
xnli_zh 0 acc 0.3818 ± 0.0069
hf-causal-experimental (pretrained=/models/baichuan-inc_baichuan-7B/,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0960 ± 0.0187
xcopa_zh 0 acc 0.7240 ± 0.0200
xstory_cloze_zh 0 acc 0.6565 ± 0.0122
xwinograd_zh 0 acc 0.8016 ± 0.0178
xnli_zh 0 acc 0.4341 ± 0.0070
对比WizardLM-7B的中文能力: hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0280 ± 0.0105
xcopa_zh 0 acc 0.5340 ± 0.0223
xstory_cloze_zh 0 acc 0.5162 ± 0.0129
xwinograd_zh 0 acc 0.5417 ± 0.0222
xnli_zh 0 acc 0.3439 ± 0.0067
hf-causal-experimental (pretrained=/models/TheBloke_WizardLM-7B-uncensored-GPTQ/,quantized=WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors,gptq_use_triton=True), limit: None, provide_description: False, num_fewshot: 5, batch_size: None Task Version Metric Value Stderr
mgsm_zh 0 acc 0.0360 ± 0.0118
xcopa_zh 0 acc 0.5420 ± 0.0223
xstory_cloze_zh 0 acc 0.5242 ± 0.0129
xwinograd_zh 0 acc 0.6071 ± 0.0218
xnli_zh 0 acc 0.3599 ± 0.0068

对比可以看到中文能力相比LLAMA系列的衍生品的确提高了很多,希望百川团队越做越好!

Checklist

0xDing commented 1 year ago

感谢分享。 但是wizardlm似乎做了4位量化而baichuan没有q,可能要考虑到这个差异对实验结果的影响。