When experimenting with the ChatTokenizer class i noticed your tests are not validating the correctness of the ChatTokenizer.decode function.
To test this for myself, i added the following lines to the end of test_tokenizer() test_chat_tokenizers.py:
for text in test_texts:
my_out = my_tokenizer.decode(my_tokenizer.encode(text))
auto_out = auto_tokenizer.decode(auto_tokenizer.encode(text))
auto_out = auto_out[4:] # skipping "<s> " at the beginning of the decoded string
if not my_out == auto_out:
assert False
When printing the misaligned outputs, i got:
â vs ☃
ð¤ vs 🤗
â¸(ï½¡Ë áµ Ë )â¸â¡ð ð ð ð ð vs ⸜(。˃ ᵕ ˂ )⸝♡𓆝 𓆟 𓆞 𓆝 𓆟
H₂ + O₂ â 2H₂O vs H₂ + O₂ ⇌ 2H₂O
读万å·书不如行万里路 vs 读万卷书不如行万里路
ç¿も木からè½ちる vs 猿も木から落ちる
When i digged into the decode() method of ChatTokenizer, i found that you use an own replace_hex() method in tokenizer.py:229, which decodes bytes one-by-one. In my opinion, this leads to the output above, where unicode-characters are decoded bytewise instead of being decoded to the expected unicode character.
I have a fix locally and will provide a PR for this by fork
When experimenting with the ChatTokenizer class i noticed your tests are not validating the correctness of the ChatTokenizer.decode function.
To test this for myself, i added the following lines to the end of test_tokenizer() test_chat_tokenizers.py:
When printing the misaligned outputs, i got:
When i digged into the decode() method of ChatTokenizer, i found that you use an own replace_hex() method in tokenizer.py:229, which decodes bytes one-by-one. In my opinion, this leads to the output above, where unicode-characters are decoded bytewise instead of being decoded to the expected unicode character.
I have a fix locally and will provide a PR for this by fork