tokenizer.decoder 抛出'utf-8' codec can't decode bytes in position 1-2: unexpected end of data异常

gaokao123 commented 5 months ago

推理完解析tokens会抛出异常，代码实现： try: print(self.tokenizer.decoder[token.tolist()[0]].decode("utf-8"), end='', flush=True) except (UnicodeDecodeError,AttributeError) as e: print(token.tolist()[0]) print('except:', e) 抛出异常信息： 90476 except: 'utf-8' codec can't decode bytes in position 1-2: unexpected end of data 119 except: 'utf-8' codec can't decode byte 0xbb in position 0: invalid start byte

jklj077 commented 5 months ago

tokenizer.decoder is a mapping of token int ids to their bytes representation. The tokens are learned by BPE algorithm at the byte-level, which indicates that not all tokens are valid Unicode codepoints. The tokens in bytes should be first concatenated and then decoded using UTF-8, which may still raise errors if the token sequence is in complete and necessiates the errors argument. Those are all taken care of in tokenize.decode. Please refer to https://huggingface.co/Qwen/Qwen-72B/blob/main/tokenization_qwen.py#L218 if you must implement by yourself.

gaokao123 commented 5 months ago

@jklj077 ok, I will try, Thank you!

QwenLM / Qwen

tokenizer.decoder 抛出'utf-8' codec can't decode bytes in position 1-2: unexpected end of data异常 #1218