[BUG] <Qwen tokenizer不支持sp_model加载吗？>

hwb96 commented 3 months ago

当前行为 | Current Behavior

准备将本地词表合并到Qwen的词表，但是发现Qwen tokenizer无论是fast还是普通的use_fast=False，也就是tokenization_qwen2.py和tokenization_qwen2_fast.py，均不支持“sp_model”，导入报错： 1.AttributeError: 'Qwen2Tokenizer' object has no attribute 'sp_model' 2.AttributeError: 'Qwen2TokenizerFast' object has no attribute 'sp_model'

代码：

import json
import os
from transformers import LlamaTokenizer, AutoTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm

qwen_tokenizer_dir = '/project/qwen/model/Qwen1.5-7B-Chat'
qwen_tokenizer = AutoTokenizer.from_pretrained(qwen_tokenizer_dir,use_fast=False)
qwen_spm = sp_pb2_model.ModelProto()
qwen_spm.ParseFromString(qwen_tokenizer.sp_model.serialized_model_proto())

运行报错： AttributeError: Qwen2Tokenizer object has no attribute 'sp_model'

改成

qwen_spm.ParseFromString(qwen_tokenizer.tokenizer.sp_model.serialized_model_proto())

依然运行报错： AttributeError: Qwen2Tokenizer object has no attribute 'sp_model'

运行环境 | Environment

- OS: Ubuntu 20.04
- Python:3.10.14
- Transformers:4.39.2
- PyTorch:2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.2

备注 | Anything else?

No response

hwb96 commented 3 months ago

https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md

看到有说明：bytes类型的普通token到id的映射可以通过tokenizer.get_vocab()获取。尚不支持也不推荐向tokenizer增加普通token。

我本意是增加行业中文词库几百个到tokenizer中，这个说明的意思是不支持扩充吗？

hwb96 commented 3 months ago

我还有一个问题，为什么Qwen1.5对某个text进行tokenize，展示出来的是乱码的？例如迟到变成了['è¿ŁåĪ°']？

jklj077 commented 3 months ago

Please do not mix the use of Qwen1 and Qwen1.5 code due to their inherent incompatibilities. Note that any coding sections pertaining to tokenization_note for Qwen1 are also outdated for Qwen1.5.
Except for the recently introduced CodeQwen, the tokenizer in Qwen models is not built upon sentencepiece; rather, it employs traditional BPE at the byte-level similar to GPT models, hence loading a sentencepiece model is not applicable.
Qwen1.5 adhere to the transformers framework's practices and follows the implemenation of GPT2Tokenizer: the token in the bytes type are encoded using a byte encoder to str, which is what you see after the tokenize call,. This is soley an artefact of the transformers implementation.
Vocabulary expansion can take place at two stages: pretokenization and BPE tokenization. The former is easy to implement and transformers supports that by tokenizer.add_tokens(). The added tokens has higher priority than the BPE tokenization. The latter requires continual learning of the BPE merges, the idea of which is illustrated in the tokenization_note. The tokenizers library can support the training of the merges.
It seems more practical for you to leverage tokenizer.add_tokens() for vocabulary expansion given its ease of use.

hwb96 commented 3 months ago

Please do not mix the use of Qwen1 and Qwen1.5 code due to their inherent incompatibilities. Note that any coding sections pertaining to tokenization_note for Qwen1 are also outdated for Qwen1.5.

Except for the recently introduced CodeQwen, the tokenizer in Qwen models is not built upon sentencepiece; rather, it employs traditional BPE at the byte-level similar to GPT models, hence loading a sentencepiece model is not applicable.

Qwen1.5 adhere to the transformers framework's practices and follows the implemenation of GPT2Tokenizer: the token in the bytes type are encoded using a byte encoder to str, which is what you see after the tokenize call,. This is soley an artefact of the transformers implementation.

Vocabulary expansion can take place at two stages: pretokenization and BPE tokenization. The former is easy to implement and transformers supports that by tokenizer.add_tokens(). The added tokens has higher priority than the BPE tokenization. The latter requires continual learning of the BPE merges, the idea of which is illustrated in the tokenization_note. The tokenizers library can support the training of the merges.

It seems more practical for you to leverage tokenizer.add_tokens() for vocabulary expansion given its ease of use.

感谢您的回复。Thank you for your response.

我平时做工程比较多，对于很多底层细节不甚了解。我可以这样理解吗：尽管SentencePiece确实支持BPE，但是Qwen团队选择自行实现BPE过程，而不是依赖于SentencePiece库提供的实现方式，所以加载SentencePiece模型并不适用。这有点类似于OpenAI自己的分词器tiktoken吗？如果我想了解更多细节，查看GPT2Tokenizer的构建过程对我会很有帮助。Because I usually work more on engineering, I'm not very familiar with many of the underlying details. Can I understand it this way: Although SentencePiece does support BPE, the Qwen team chose to implement the BPE process on their own, rather than relying on the implementation provided by the SentencePiece library, so loading a SentencePiece model is not applicable. Is this somewhat similar to OpenAI's own tokenizer, tiktoken?If I want to learn more details, looking into the construction process of the GPT2Tokenizer would be very helpful for me.

谢谢你的建议，我会尝试着使用tokenizer.add_tokens()来扩展词汇，并学习pretokenization and BPE tokenization的不同。Thank you for your suggestion. I will use tokenizer.add_tokens() to expand the vocabulary and learn about the differences between pretokenization and BPE tokenization.

jklj077 commented 3 months ago

The core algorithm of BPE similar but the implmentation details are quite different between GPT or tiktoken (BPE at the byte-level or tiktoken) and sentencepiece (BPE at the char-level with byte fallback). It is mentioned in tokenization_note.

sentencepiece operates on Unicode code points or chars, not on UTF-8 encoded bytes. For example, "你好" is two chars but 6 bytes (b"\xe4\xbd\xa0\xe5\xa5\xbd"). If "你好" is a token, sentencepiece needs one merge ("你", "好"), while BPE at the byte level needs 5 merges (b"\xe4", b"\xbd"), (b"\xe4\xbd", b"\xa0"), (b"\xe5", b"\xa5"), (b"\xe5\xa5", "\xbd"), (b"\xe4\xbd\xa0", "\xe5\xa5\xbd").

The thing is that there are near 150K Unicode code points but there are only 256 possible bytes. To achieve full coverage of the vocabulary, it is unrealistic for sentencepiece to add all code points as tokens to the vocabulary, so it adopts the byte fallback trick: if a codepoint is not a token in the vocabulary, sentencepiece with byte fallback tokenizes it as bytes. For example, suppose "佰" (b"\xe4\xbd\xb0") is not in the previous vocabulary, sentencepiece produces the sequence ("<0xE4>", "<0xBD>", "<0xB0>"), while BPE at the byte level produces the sequence (b"\xe4\xbd", b"\xb0").

They are both BPE, but they are different BPE.

QwenLM / Qwen2