Closed cameron-chen closed 1 month ago
We recommend to use codes in LongAlign for evaluation. Specifically, you need to modify the code in Line 68 from
if "internlm" in valid_path or "chatglm" in valid_path or "longalign-6b" in valid_path:
to
if "internlm" in valid_path or "glm" in valid_path or "longalign-6b" in valid_path:
to use the chat template.
We recommend to use codes in LongAlign for evaluation. Specifically, you need to modify the code in Line 68 from
if "internlm" in valid_path or "chatglm" in valid_path or "longalign-6b" in valid_path:
toif "internlm" in valid_path or "glm" in valid_path or "longalign-6b" in valid_path:
to use the chat template.
Hi, thanks for the suggestion! I can get a score of 7.42
, which is quite close to the reported score given the randomness of GPT4's judgment.
System Info / 系統信息
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
Thanks for the great work on the long-context model and long-context benchmark.
I find it challenging to reproduce the results of
THUDM/glm-4-9b-chat-1m
on LongBench-Chat (Results are found on HuggingFace repo).I have tried several ways to generate responses:
eval.py
script (https://github.com/THUDM/LongAlign/blob/9ae0b597737c6658f4350ef7a42d5d01980d142c/LongBench_Chat/eval.py) directly. However, I find no chat template is applied.5.46
vllm
and conduct an evaluation on Longbench-Chateval.py
script.vllm
.7.22
Dependency info:
Expected behavior / 期待表现
It makes sense that method 1 did not work as no chat template was applied. Method 2 has a closer value, 7.22, compared with the reported score, 7.82, on the HuggingFace repo, but still, there is a gap.
Can you please share the proper sampling params, or the code snippets, to reproduce the score on LongBench-Chat?