`convert_ids_to_tokens` not working as expected.

databricks / dbrx

Code examples and resources for DBRX, a large language model developed by Databricks

Other

2.5k stars 236 forks source link

from transformers import AutoTokenizer t = AutoTokenizer.from_pretrained('/models/dbrx-instruct/') t.encode('请问你是谁') # [15225, 57107, 57668, 21043, 39013, 223] t.decode([15225, 57107, 57668, 21043, 39013, 223]) # '请问你是谁' print(t.convert_ids_to_tokens(15225)) # 'è¯·'

This is expected. It is how (e.g.) the GPT2Tokenizer in transformers handles this. Basically, not all token ids can be individually decoded as utf-8, so the convert_ids_to_tokens uses latin-1 decoding.

Illustrative example:

In [13]: tt.encoding.decode_single_token_bytes(15225)
Out[13]: b'\xe8\xaf\xb7'

In [14]: tt.encoding.decode_single_token_bytes(15225).decode('latin-1')
Out[14]: 'è¯·'

In [15]: tt.encoding.decode_single_token_bytes(15225).decode('utf-8')
Out[15]: '请'

In general, it cannot be assumed that you can just ''.join(t.convert_ids_to_tokens([...])) and have it be the same as t.decode([...])

databricks / dbrx

`convert_ids_to_tokens` not working as expected. #21