I have saw the chatglm2 benchmark in c-eval leaderboard that have score avg: 71
While the c-eval score report in readme in version zeroshort just max is version chatglm12B: 61
So I'm not sure that chatglm-12B with fewshot can be improve from 61->71, or another model, and prompt engineering,
Can you give me the detail?
Is there an existing issue for this?
Current Behavior
I have saw the chatglm2 benchmark in c-eval leaderboard that have score avg: 71 While the c-eval score report in readme in version zeroshort just max is version chatglm12B: 61 So I'm not sure that chatglm-12B with fewshot can be improve from 61->71, or another model, and prompt engineering, Can you give me the detail?