databricks / dbrx

Code examples and resources for DBRX, a large language model developed by Databricks
https://www.databricks.com/
Other
2.5k stars 236 forks source link

`convert_ids_to_tokens` not working as expected. #21

Closed jcao-ai closed 6 months ago

jcao-ai commented 6 months ago
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained('/models/dbrx-instruct/')
t.encode('请问你是谁')
# [15225, 57107, 57668, 21043, 39013, 223]
t.decode([15225, 57107, 57668, 21043, 39013, 223])
# '请问你是谁'
print(t.convert_ids_to_tokens(15225))
# '请'

I suppose it should output token text like

dakinggg commented 6 months ago

This is expected. It is how (e.g.) the GPT2Tokenizer in transformers handles this. Basically, not all token ids can be individually decoded as utf-8, so the convert_ids_to_tokens uses latin-1 decoding.

Illustrative example:

In [13]: tt.encoding.decode_single_token_bytes(15225)
Out[13]: b'\xe8\xaf\xb7'

In [14]: tt.encoding.decode_single_token_bytes(15225).decode('latin-1')
Out[14]: '请'

In [15]: tt.encoding.decode_single_token_bytes(15225).decode('utf-8')
Out[15]: '请'

In general, it cannot be assumed that you can just ''.join(t.convert_ids_to_tokens([...])) and have it be the same as t.decode([...])