Closed silverriver closed 6 months ago
This is documented in https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md#regular-tokens.
You can use Qwen2Tokenizer instead if you need tokens in str
. (Please be aware that due to the tokenization mechanism those tokens are encoded bytes. You need to decode them to get the actual string, as what is done in original gpt2 tokenizer.)
Thank you for your reponse.
Qwen2Tokenizer seems to work in my case. However, I have a few questions regarding to this tokenizer.
Thanks in advance.
Unfortunately, it is not possible to use the Qwen2Tokenizer
class to load from QWenTokenizer
files, or vice versa. It is also not recommended to mix the use of Qwen and Qwen2 codes. For better compatibility with the transformers
-ecosystem, we advise you to upgrade to Qwen2.
The vocabulary should be considered as the same. The code implementation is different and the signatures of the functions are different in some way.
Qwen1.5 is the beta version of Qwen2, as stated in https://github.com/QwenLM/Qwen1.5#introduction.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Regular tokens in Qwen's tokenzier is represented as bytes. The huggingface tokenizer implemented in Qwen's hf model return
bytes
-typed tokens: https://huggingface.co/Qwen/Qwen-7B/blob/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/tokenization_qwen.py#L136However, huggingface tokenzier interface use
str
-typed tokens: https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/tokenization_utils_base.py#L1666Some applications may take
str
-typed tokens as their default. Such as: https://github.com/outlines-dev/outlines/blob/6484d8c5439fa0744656bcc05794592635f4533c/outlines/integrations/utils.py#L59期望行为 | Expected Behavior
Use
str
-typed tokens in hf implementations复现方法 | Steps To Reproduce
n/a
运行环境 | Environment
备注 | Anything else?
n/a