Closed gaokao123 closed 5 months ago
tokenizer.decoder
is a mapping of token int
ids to their bytes
representation. The tokens are learned by BPE algorithm at the byte-level, which indicates that not all tokens are valid Unicode codepoints. The tokens in bytes
should be first concatenated and then decoded using UTF-8, which may still raise errors if the token sequence is in complete and necessiates the errors
argument. Those are all taken care of in tokenize.decode
. Please refer to https://huggingface.co/Qwen/Qwen-72B/blob/main/tokenization_qwen.py#L218 if you must implement by yourself.
@jklj077 ok, I will try, Thank you!
推理完解析tokens会抛出异常, 代码实现: try: print(self.tokenizer.decoder[token.tolist()[0]].decode("utf-8"), end='', flush=True) except (UnicodeDecodeError,AttributeError) as e: print(token.tolist()[0]) print('except:', e) 抛出异常信息: 90476 except: 'utf-8' codec can't decode bytes in position 1-2: unexpected end of data 119 except: 'utf-8' codec can't decode byte 0xbb in position 0: invalid start byte