THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
4.59k stars 361 forks source link

chatglm3与glm4的tokenizer有什么不同吗? chatglm3可以使用outlines,但是glm4会报错 #362

Closed Mewral closed 1 week ago

Mewral commented 1 month ago

System Info / 系統信息

transformers==4.41.2 outlines==0.0.44 python==3.10.13

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

import outlines
from transformers import AutoTokenizer

args = {"trust_remote_code":True}

glm4_tokenizer = AutoTokenizer.from_pretrained("glm-4-9b-chat", trust_remote_code=True)
glm3_tokenizer = AutoTokenizer.from_pretrained("chatglm3-6b", trust_remote_code=True)

model = outlines.models.transformers("glm-4-9b-chat", device="cuda:0", model_kwargs=args, tokenizer_kwargs=args)

prompt = "你是谁"

generator = outlines.generate.choice(model, ["汤姆", "杰瑞"])
answer = generator(prompt)
print(answer)

Expected behavior / 期待表现

使用chatglm3可以正常输出,但是glm4会报错

zRzRzRzRzRzRzR commented 1 month ago

GLM-4 是BPE的分词方式,你的报错是不是 TypeError: cannot use a string pattern on a bytes-like object

Mewral commented 1 month ago

@zRzRzRzRzRzRzR 是的,glm3不是bpe吗? 另外想问一下如何在tokenizer对象里面看他是属于哪种分词模型呢? 感谢

zRzRzRzRzRzRzR commented 1 week ago

是的,但是文字切分不一样,huggingface有一个相似的issue这里 https://huggingface.co/THUDM/glm-4-9b-chat/discussions/69#66d29c175ae47374c28a17a2