bojone / bytepiece

更纯粹、更高压缩率的Tokenizer
Apache License 2.0
442 stars 22 forks source link

convert_to_sentencepiece error : *** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1: unexpected end of data #18

Closed FlyCarrot closed 3 months ago

FlyCarrot commented 3 months ago

当使用convert_to_sentencepiece对models中的bytepiece.plus.160k_sp.model进行转换的时候,会出现报错

Traceback (most recent call last):
  File "/data/bytepiece/test_demo/model_convert.py", line 3, in <module>
    tokenizer1.convert_to_sentencepiece('bytepiece.plus.160k_sp.model')
  File "/home/${USER}/miniconda3/envs/bytepiece/lib/python3.10/site-packages/bytepiece/bytepiece.py", line 364, in convert_to_sentencepiece
    p = re.sub(' ', '▁', p.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1: unexpected end of data

具体到p这里的内容应该是p.decode()报错,print(p)的内容是 b' \xc2' 这里是因为词表里有不能转换到sentencePiece的token吗?求解答,谢谢!

重新阅读了doc,是因为models下的tokenizer在训练的阶段不一定有 ensure_unicode 字段,所以没法做到无损转换对吗?

bojone commented 3 months ago

是的,ensure_unicode=True的已经标记了eu

FlyCarrot commented 3 months ago

好的,原来eu指示的是ensure_unicode