THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型
Other
15.73k stars 1.85k forks source link

[BUG/Help] <为什么同一个字符会对应多个token_id?(排除BPE分词处理)> #598

Open ito-integration opened 1 year ago

ito-integration commented 1 year ago

Is there an existing issue for this?

Current Behavior

我发现模型的tokenizer会把同一个token对应于至少两个token_id:例如B→30949和347、C→30942和319、D→30952和367。它们都是单个字母,应该不存在分词的问题。那么在使用tokenizer.decode将token转化为token_ids时,这样做不会出现混乱吗?

Expected Behavior

No response

Steps To Reproduce

Environment

Anything else?

No response