01-ai / Yi-1.5

Yi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.
Apache License 2.0
355 stars 20 forks source link

关于 tokenizer 编码 <|im_start|> 的问题 #33

Closed amulil closed 1 week ago

amulil commented 1 week ago

我用下面的代码测试:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat")
print("token of <|im_start|>: " + str(tokenizer.encode("<|im_start|>")))
print("token of <|im_end|>: " + str(tokenizer.encode("<|im_end|>")))

结果很奇怪:

token of <|im_start|>: [1581, 59705, 622, 59593, 5858, 46826]
token of <|im_end|>: [7]

按理说 token of <|im_start|> 输出结果应该是 6.

我不知道是不是 tokenizer 的问题,所以我在官方提了pr : https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/12 https://huggingface.co/01-ai/Yi-1.5-9B-Chat/discussions/13

麻烦查看一下这里是否有问题,感谢。

EricLingRui commented 1 week ago

try: tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)

amulil commented 1 week ago

try: tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-9B-Chat", use_fast=False)

感谢,这样上面的问题就没有了