baichuan-inc / Baichuan-13B

A 13B large language model developed by Baichuan Intelligent Technology
https://huggingface.co/baichuan-inc/Baichuan-13B-Chat
Apache License 2.0
2.98k stars 236 forks source link

Baichuan tokenizer对诗词分词不准确 #161

Closed CanvaChen closed 1 year ago

CanvaChen commented 1 year ago

“白日依山尽,黄河入海流。欲穷千里目,更上一层楼。” 分词结果如下: ['白', '日', '依', '山', '尽', ',', '黄河', '入', '海', '流', '。', '欲', '穷', '千里', '目', ',', '更', '上一', '层', '楼', '。']

“上一”在一块不合理。

CanvaChen commented 1 year ago

原因已知

jcdiv47 commented 1 year ago

原因已知

可以分享一下原因嘛?

CanvaChen commented 1 year ago

@jcdiv47 因为词表里有“上一”,所以会结合到一起