FreedomIntelligence / LLMZoo

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
Apache License 2.0
2.93k stars 201 forks source link

The streaming output contains garbled characters. #40

Closed Nipi64310 closed 1 year ago

Nipi64310 commented 1 year ago

Hi @wabyking , thanks for sharing this. Because Bloom's tokenizer contains some characters that correspond to multiple token ids, there will be garbled characters when streaming output; how can such garbled characters be avoided?

https://github.com/FreedomIntelligence/LLMZoo/blob/main/llmzoo/deploy/webapp/inference.py#L113

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('Phoenix-chat-7b')
>>> tokenizer.encode('闫英英')
[1097, 108, 3532, 3532]
>>> tokenizer.decode(1097)
'�'
>>> tokenizer.decode(108)
'�'
>>> tokenizer.decode([1097,108])
'闫'
Nipi64310 commented 1 year ago

use chr(0xFFFD) to check