Closed Nipi64310 closed 1 year ago
Hi @wabyking , thanks for sharing this. Because Bloom's tokenizer contains some characters that correspond to multiple token ids, there will be garbled characters when streaming output; how can such garbled characters be avoided?
https://github.com/FreedomIntelligence/LLMZoo/blob/main/llmzoo/deploy/webapp/inference.py#L113 >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained('Phoenix-chat-7b') >>> tokenizer.encode('闫英英') [1097, 108, 3532, 3532] >>> tokenizer.decode(1097) '�' >>> tokenizer.decode(108) '�' >>> tokenizer.decode([1097,108]) '闫'
use chr(0xFFFD) to check
Hi @wabyking , thanks for sharing this. Because Bloom's tokenizer contains some characters that correspond to multiple token ids, there will be garbled characters when streaming output; how can such garbled characters be avoided?