关于 tokenizer 编码 <|im_start|> 的问题

amulil commented 1 week ago

我用下面的代码测试：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
print("token of <|im_start|>: " + str(tokenizer.encode("<|im_start|>")))
print("token of <|im_end|>: " + str(tokenizer.encode("<|im_end|>")))

结果很奇怪：

token of <|im_start|>: [1581, 59705, 622, 59593, 5858, 46826]
token of <|im_end|>: [7]

按理说 token of <|im_start|> 输出结果应该是 6.

我不知道是不是 tokenizer 的问题，所以我在官方提了pr ： https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/12 https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/13

麻烦查看一下这里是否有问题，感谢。

EricLingRui commented 1 week ago

try: tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)

amulil commented 1 week ago

try: tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)

感谢，这样上面的问题就没有了

01-ai / Yi-1.5

关于 tokenizer 编码 <|im_start|> 的问题 #33