QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

请问可以自己修改special token吗? #1114

Closed AllenShow closed 6 months ago

AllenShow commented 7 months ago

您好!感谢您们团队出色的模型与文档! EXTRAS = tuple((f"<|extra_{i}|>" for i in range(205))) <|extra_0|>~<|extra_204|>是用来存放额外special token的是吧,现在想基于千问模型finetune,是否可以将其中几个extra_的token改成自己想用的token?同时训练语料也用自己的token,这样可行吗? 需要修改的代码就是 tokenization_qwen.py 吧?还有别的地方要改吗?

jklj077 commented 7 months ago

In Qwen(1.0), the text representation of special tokens can be freely customized. To make the necessary adjustments, please review the "Special tokens" section within the tokenization documentation found at https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md#special-tokens. Additionally, it's crucial to examine the data preprocessing functions in finetune.py and qwen_generation_utils.py, since special tokens are handled differently from regular tokens.

在Qwen(1.0)中,特殊token的文字表示可以自由定制。若要进行必要的调整,请查阅tokenization文档中“Special tokens”部分(链接:https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md#special-tokens)。此外,在finetune.pyqwen_generation_utils.py中数据预处理函数的实现至关重要,因为特殊token的处理方式与常规token有所不同。