Closed jcao-ai closed 6 months ago
This is expected. It is how (e.g.) the GPT2Tokenizer
in transformers handles this. Basically, not all token ids can be individually decoded as utf-8
, so the convert_ids_to_tokens
uses latin-1
decoding.
Illustrative example:
In [13]: tt.encoding.decode_single_token_bytes(15225)
Out[13]: b'\xe8\xaf\xb7'
In [14]: tt.encoding.decode_single_token_bytes(15225).decode('latin-1')
Out[14]: '请'
In [15]: tt.encoding.decode_single_token_bytes(15225).decode('utf-8')
Out[15]: '请'
In general, it cannot be assumed that you can just ''.join(t.convert_ids_to_tokens([...]))
and have it be the same as t.decode([...])
I suppose it should output token text like
请
?