THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
5.2k stars 430 forks source link

Reproducing the results of THUDM/glm-4-9b-chat-1m on LongBench-Chat #553

Closed cameron-chen closed 1 month ago

cameron-chen commented 2 months ago

System Info / 系統信息

cuda 12.3
python 3.11 

torch==2.3.1
vllm==0.5.3.post1
vllm-flash-attn==2.5.9.post1
transformers==4.44.1

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

Thanks for the great work on the long-context model and long-context benchmark.

I find it challenging to reproduce the results of THUDM/glm-4-9b-chat-1m on LongBench-Chat (Results are found on HuggingFace repo).

I have tried several ways to generate responses:

  1. Use the eval.py script (https://github.com/THUDM/LongAlign/blob/9ae0b597737c6658f4350ef7a42d5d01980d142c/LongBench_Chat/eval.py) directly. However, I find no chat template is applied.
    • Cmd:
      python eval.py --model_path  THUDM/glm-4-9b-chat-1m
    • Evaluation score: 5.46
  2. Generate the responses by vllm and conduct an evaluation on Longbench-Chat eval.py script.

Expected behavior / 期待表现

It makes sense that method 1 did not work as no chat template was applied. Method 2 has a closer value, 7.22, compared with the reported score, 7.82, on the HuggingFace repo, but still, there is a gap.

Can you please share the proper sampling params, or the code snippets, to reproduce the score on LongBench-Chat?

davidlvxin commented 1 month ago

We recommend to use codes in LongAlign for evaluation. Specifically, you need to modify the code in Line 68 from if "internlm" in valid_path or "chatglm" in valid_path or "longalign-6b" in valid_path: to if "internlm" in valid_path or "glm" in valid_path or "longalign-6b" in valid_path: to use the chat template.

cameron-chen commented 2 weeks ago

We recommend to use codes in LongAlign for evaluation. Specifically, you need to modify the code in Line 68 from if "internlm" in valid_path or "chatglm" in valid_path or "longalign-6b" in valid_path: to if "internlm" in valid_path or "glm" in valid_path or "longalign-6b" in valid_path: to use the chat template.

Hi, thanks for the suggestion! I can get a score of 7.42, which is quite close to the reported score given the randomness of GPT4's judgment.