Tokenizer.decode not decoding unicode correctly

When experimenting with the ChatTokenizer class i noticed your tests are not validating the correctness of the ChatTokenizer.decode function.

To test this for myself, i added the following lines to the end of test_tokenizer() test_chat_tokenizers.py:

    for text in test_texts:
        my_out = my_tokenizer.decode(my_tokenizer.encode(text))
        auto_out = auto_tokenizer.decode(auto_tokenizer.encode(text))
        auto_out = auto_out[4:]  # skipping "<s> " at the beginning of the decoded string
        if not my_out == auto_out:
            assert False

When printing the misaligned outputs, i got:

â vs ☃
ð¤ vs 🤗
â¸(ï½¡Ë áµ Ë )â¸â¡ð ð ð ð ð vs ⸜(｡˃ ᵕ ˂ )⸝♡𓆝 𓆟 𓆞 𓆝 𓆟
H₂ + O₂ â 2H₂O vs H₂ + O₂ ⇌ 2H₂O
è¯»万å·书不如行万里路 vs 读万卷书不如行万里路
ç¿も木からè½ちる vs 猿も木から落ちる

When i digged into the decode() method of ChatTokenizer, i found that you use an own replace_hex() method in tokenizer.py:229, which decodes bytes one-by-one. In my opinion, this leads to the output above, where unicode-characters are decoded bytewise instead of being decoded to the expected unicode character.

I have a fix locally and will provide a PR for this by fork

99991 / SimpleTinyLlama

Tokenizer.decode not decoding unicode correctly #1