Reproducing the results of THUDM/glm-4-9b-chat-1m on LongBench-Chat

cameron-chen commented 2 months ago

System Info / 系統信息

cuda 12.3
python 3.11 

torch==2.3.1
vllm==0.5.3.post1
vllm-flash-attn==2.5.9.post1
transformers==4.44.1

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Thanks for the great work on the long-context model and long-context benchmark.

I find it challenging to reproduce the results of THUDM/glm-4-9b-chat-1m on LongBench-Chat (Results are found on HuggingFace repo).

I have tried several ways to generate responses:

Use the eval.py script (https://github.com/THUDM/LongAlign/blob/9ae0b597737c6658f4350ef7a42d5d01980d142c/LongBench_Chat/eval.py) directly. However, I find no chat template is applied.
- Cmd:
```
python eval.py --model_path  THUDM/glm-4-9b-chat-1m
```
- Evaluation score: 5.46
Generate the responses by vllm and conduct an evaluation on Longbench-Chat eval.py script.
- Follow the demo code for inference)-,%E4%BD%BF%E7%94%A8%20VLLM%E5%90%8E%E7%AB%AF%E8%BF%9B%E8%A1%8C%E6%8E%A8%E7%90%86,-%3A) via vllm.
- Sampling params:
```
temperature=0.95
stop_token_ids=[151329, 151336, 151338]
max_model_len=120000
```
- Evaluation score: 7.22
  
  Dependency info:
- LongBench-Chat (latest commit): https://github.com/THUDM/LongAlign/tree/9ae0b597737c6658f4350ef7a42d5d01980d142c

Expected behavior / 期待表现

It makes sense that method 1 did not work as no chat template was applied. Method 2 has a closer value, 7.22, compared with the reported score, 7.82, on the HuggingFace repo, but still, there is a gap.

Can you please share the proper sampling params, or the code snippets, to reproduce the score on LongBench-Chat?

davidlvxin commented 1 month ago

We recommend to use codes in LongAlign for evaluation. Specifically, you need to modify the code in Line 68 from if "internlm" in valid_path or "chatglm" in valid_path or "longalign-6b" in valid_path: to if "internlm" in valid_path or "glm" in valid_path or "longalign-6b" in valid_path: to use the chat template.

cameron-chen commented 2 weeks ago

We recommend to use codes in LongAlign for evaluation. Specifically, you need to modify the code in Line 68 from if "internlm" in valid_path or "chatglm" in valid_path or "longalign-6b" in valid_path: to if "internlm" in valid_path or "glm" in valid_path or "longalign-6b" in valid_path: to use the chat template.

Hi, thanks for the suggestion! I can get a score of 7.42, which is quite close to the reported score given the randomness of GPT4's judgment.

THUDM / GLM-4