tokenizer_config.json 文件的chat_template中， <sop> 可能存在遗漏<>

System Info / 系統信息

CUDA Version: 12.2 transformers Version: 4.44.2 Python: 3.9.19 Operating system: Linux g3001 5.4.0-144-generic

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

我在HuggingFace中的 LongWriter-glm4-9b/tokenizer_config.json 文件中发现了一个关于 <sop> 符号的潜在问题。

在 "chat_template" 配置中，sop 符号并没有包含在尖括号（<>）中。相关部分如下：

"chat_template": "{% for message in messages %}{% if loop.first %}[gMASK]sop<|{{ message['role'] }}|>\n {{ message['content'] }}{% else %}<|{{ message['role'] }}|>\n {{ message['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}",

在实际运行时生成的提示如下，其中的 sop 也没有尖括号：

INFO 09-09 17:10:35 logger.py:36] Received request chat-1aff40ba3f3b4c2392be6f89dede9675: prompt: '[gMASK]sop<|system|>\n You are a helpful assistant.<|user|>\n Who won the world series in 2020?<|assistant|>'

然而，在代码的 self.special_tokens 配置中，sop 符号是包含尖括号的，如下所示：

self.special_tokens = ["<|endoftext|>", "[MASK]", "[gMASK]", "[sMASK]", "<sop>", "<eop>", "<|system|>", "<|user|>", "<|assistant|>", "<|observation|>", "<|begin_of_image|>", "<|end_of_image|>", "<|begin_of_video|>", "<|end_of_video|>"]

此外，我还发现 LongCite-glm4-9b 模型的 "chat_template" 中<sop> 是包含尖括号的，如下所示：

{% for message in messages %}{% if loop.first %}[gMASK]<sop>

因此推测 <sop> 符号在 "chat_template" 中应该使用尖括号包裹，但目前似乎遗漏了。

Expected behavior / 期待表现

LongWriter-glm4-9b/tokenizer_config.json 文件的 "chat_template" 部分中，是否应该改为使用 <sop> 形式。感谢！

THUDM / LongWriter