"Unexpected end of data" when decoding partial Unicode characters with World tokenizer

RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

MIT License

1.37k stars 90 forks source link

"Unexpected end of data" when decoding partial Unicode characters with World tokenizer #102

Closed cgisky1980 closed 1 year ago

cgisky1980 commented 1 year ago

rwkv world 3b or 7b Q8_0 input
Translate the following text into Korean: "Hello" output File "/www/wenda-pi/llms/rwkvcpp/rwkv_tokenizer.py", line 94, in decode return self.decodeBytes(tokens).decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

Modify the above file return self.decodeBytes(tokens).decode('utf-8','ignore')

output 안하세요. but the correct one should be 안녕하세요

lost character 녕

by model rwkv world fp16 is correct

saharNooby commented 1 year ago

by model rwkv world fp16 is correct

Do you mean that the same code works correctly when using FP16 model?

cgisky1980 commented 1 year ago

by model rwkv world fp16 is correct

Do you mean that the same code works correctly when using FP16 model? yes

I mean, models without quantization don't have this problem

saharNooby commented 1 year ago

Then I think this issue actually points at 2 separate problems:

quantized model produces less correct text that non-quantized model
UnicodeDecodeError error is thrown

For 1, there is no real solution. Quantization reduces quality, it is expected, since information is cut from the model to make it smaller.

2 is an actual bug that can be fixed. I'll put it into my backlog, but anyone can take it.

konflictue commented 1 year ago

Modify the above file return self.decodeBytes(tokens).decode('utf-8','ignore') lost character 녕

@cgisky1980 You can also try the 'replace' mode for errors here, which had better chances for producing all characters properly when using the world models of rwkv.

cgisky1980 commented 1 year ago

Then I think this issue actually points at 2 separate problems:

quantized model produces less correct text that non-quantized model

UnicodeDecodeError error is thrown

For 1, there is no real solution. Quantization reduces quality, it is expected, since information is cut from the model to make it smaller.

2 is an actual bug that can be fixed. I'll put it into my backlog, but anyone can take it.

Because this issue is 100% reproducible in models 3B and 7B, I don't think it's a problem of accuracy loss

ps. https:github.com/saharNooby/rwkv.cpp/issues/19 this is the good first issue. LOL

cgisky1980 commented 1 year ago

Modify the above file return self.decodeBytes(tokens).decode('utf-8','ignore') lost character 녕

@cgisky1980 You can also try the 'replace' mode for errors here, which had better chances for producing all characters properly when using the world models of rwkv.

yes，it works