bojone / bytepiece

更纯粹、更高压缩率的Tokenizer
Apache License 2.0
442 stars 22 forks source link

转换成 sentencepiece 的之后载入失败 #13

Open yzlnew opened 7 months ago

yzlnew commented 7 months ago

通过类方法 convert_to_sentencepiece 转换为 sp model,再进行 load 的时候报错

import sentencepiece as spm

sp_model = spm.SentencePieceProcessor()
sp_model.Load("sp.model")
libc++abi: terminating due to uncaught exception of type Darts::Details::Exception: /Users/runner/work/sentencepiece/sentencepiece/third_party/darts_clone/darts.h:1143: exception: failed to insert key: zero-length key

相关 issue https://github.com/google/sentencepiece/issues/156

模型里面有 "\0",是否应该在 convert 的时候去掉,以及是否有副作用?

bojone commented 7 months ago

转换前的模型方便共享吗?或者给一个最小的复现代码?

yzlnew commented 7 months ago

@bojone 按照 README 的例子复现。模型在这里 https://microbin.yzlnew.com/upload/sloth-worm-falcon

from bytepiece import Tokenizer

tokenizer1 = Tokenizer('tokenizer_80k_small_isolated.model')
tokenizer1.convert_to_sentencepiece('sp.model')

import sentencepiece as spm
tokenizer2 = spm.SentencePieceProcessor("sp.model")
bojone commented 7 months ago

@yzlnew 看上去你不是ensure_unicode版本?只有ensure_unicode版本的模型才保证能顺利转换成sentencepiece(在较新的版本中,ensure_unicode默认是开启的,你可以检查一下)

yzlnew commented 7 months ago

@bojone 奇怪了,这个模型是用 0.6.3 训练的,而且也是 ensure_unicode 的。