Closed hwb96 closed 3 months ago
https://github.com/QwenLM/Qwen/blob/main/tokenization_note_zh.md
看到有说明:bytes类型的普通token到id的映射可以通过tokenizer.get_vocab()获取。 尚不支持也不推荐向tokenizer增加普通token。
我本意是增加行业中文词库几百个到tokenizer中,这个说明的意思是不支持扩充吗?
我还有一个问题,为什么Qwen1.5对某个text进行tokenize,展示出来的是乱码的?例如迟到变成了['è¿ŁåĪ°']?
sentencepiece
; rather, it employs traditional BPE at the byte-level similar to GPT models, hence loading a sentencepiece
model is not applicable.transformers
framework's practices and follows the implemenation of GPT2Tokenizer
: the token in the bytes
type are encoded using a byte encoder to str
, which is what you see after the tokenize
call,. This is soley an artefact of the transformers
implementation.transformers
supports that by tokenizer.add_tokens()
. The added tokens has higher priority than the BPE tokenization. The latter requires continual learning of the BPE merges, the idea of which is illustrated in the tokenization_note. The tokenizers
library can support the training of the merges.tokenizer.add_tokens()
for vocabulary expansion given its ease of use.
- Please do not mix the use of Qwen1 and Qwen1.5 code due to their inherent incompatibilities. Note that any coding sections pertaining to tokenization_note for Qwen1 are also outdated for Qwen1.5.
- Except for the recently introduced CodeQwen, the tokenizer in Qwen models is not built upon
sentencepiece
; rather, it employs traditional BPE at the byte-level similar to GPT models, hence loading asentencepiece
model is not applicable.- Qwen1.5 adhere to the
transformers
framework's practices and follows the implemenation ofGPT2Tokenizer
: the token in thebytes
type are encoded using a byte encoder tostr
, which is what you see after thetokenize
call,. This is soley an artefact of thetransformers
implementation.- Vocabulary expansion can take place at two stages: pretokenization and BPE tokenization. The former is easy to implement and
transformers
supports that bytokenizer.add_tokens()
. The added tokens has higher priority than the BPE tokenization. The latter requires continual learning of the BPE merges, the idea of which is illustrated in the tokenization_note. Thetokenizers
library can support the training of the merges.- It seems more practical for you to leverage
tokenizer.add_tokens()
for vocabulary expansion given its ease of use.
感谢您的回复。Thank you for your response.
我平时做工程比较多,对于很多底层细节不甚了解。我可以这样理解吗:尽管SentencePiece确实支持BPE,但是Qwen团队选择自行实现BPE过程,而不是依赖于SentencePiece库提供的实现方式,所以加载SentencePiece模型并不适用。这有点类似于OpenAI自己的分词器tiktoken吗?如果我想了解更多细节,查看GPT2Tokenizer的构建过程对我会很有帮助。Because I usually work more on engineering, I'm not very familiar with many of the underlying details. Can I understand it this way: Although SentencePiece does support BPE, the Qwen team chose to implement the BPE process on their own, rather than relying on the implementation provided by the SentencePiece library, so loading a SentencePiece model is not applicable. Is this somewhat similar to OpenAI's own tokenizer, tiktoken?If I want to learn more details, looking into the construction process of the GPT2Tokenizer would be very helpful for me.
谢谢你的建议,我会尝试着使用tokenizer.add_tokens()来扩展词汇,并学习pretokenization and BPE tokenization的不同。Thank you for your suggestion. I will use tokenizer.add_tokens() to expand the vocabulary and learn about the differences between pretokenization and BPE tokenization.
The core algorithm of BPE similar but the implmentation details are quite different between GPT or tiktoken (BPE at the byte-level or tiktoken
) and sentencepiece (BPE at the char-level with byte fallback). It is mentioned in tokenization_note.
sentencepiece
operates on Unicode code points or chars, not on UTF-8 encoded bytes. For example, "你好"
is two chars but 6 bytes (b"\xe4\xbd\xa0\xe5\xa5\xbd"
). If "你好"
is a token, sentencepiece
needs one merge ("你", "好")
, while BPE at the byte level needs 5 merges (b"\xe4", b"\xbd"), (b"\xe4\xbd", b"\xa0"), (b"\xe5", b"\xa5"), (b"\xe5\xa5", "\xbd"), (b"\xe4\xbd\xa0", "\xe5\xa5\xbd")
.
The thing is that there are near 150K Unicode code points but there are only 256 possible bytes. To achieve full coverage of the vocabulary, it is unrealistic for sentencepiece to add all code points as tokens to the vocabulary, so it adopts the byte fallback trick: if a codepoint is not a token in the vocabulary, sentencepiece
with byte fallback tokenizes it as bytes. For example, suppose "佰"
(b"\xe4\xbd\xb0"
) is not in the previous vocabulary, sentencepiece produces the sequence ("<0xE4>", "<0xBD>", "<0xB0>")
, while BPE at the byte level produces the sequence (b"\xe4\xbd", b"\xb0")
.
They are both BPE, but they are different BPE.
当前行为 | Current Behavior
准备将本地词表合并到Qwen的词表,但是发现Qwen tokenizer无论是fast还是普通的use_fast=False,也就是tokenization_qwen2.py和tokenization_qwen2_fast.py,均不支持“sp_model”,导入报错: 1.AttributeError: 'Qwen2Tokenizer' object has no attribute 'sp_model' 2.AttributeError: 'Qwen2TokenizerFast' object has no attribute 'sp_model'
代码:
运行报错: AttributeError:
Qwen2Tokenizer
object has no attribute 'sp_model'改成
依然运行报错: AttributeError:
Qwen2Tokenizer
object has no attribute 'sp_model'运行环境 | Environment
备注 | Anything else?
No response