99991 / SimpleTinyLlama

https://github.com/jzhang38/TinyLlama using only PyTorch
Apache License 2.0
12 stars 1 forks source link

Tokenizer.decode not decoding unicode correctly #1

Closed SebastianSchiefer closed 7 months ago

SebastianSchiefer commented 7 months ago

When experimenting with the ChatTokenizer class i noticed your tests are not validating the correctness of the ChatTokenizer.decode function.

To test this for myself, i added the following lines to the end of test_tokenizer() test_chat_tokenizers.py:

    for text in test_texts:
        my_out = my_tokenizer.decode(my_tokenizer.encode(text))
        auto_out = auto_tokenizer.decode(auto_tokenizer.encode(text))
        auto_out = auto_out[4:]  # skipping "<s> " at the beginning of the decoded string
        if not my_out == auto_out:
            assert False

When printing the misaligned outputs, i got:

☃ vs ☃
🤗 vs 🤗
⸜(。˃ ᵕ ˂ )⸝♡𓆝 𓆟 𓆞 𓆝 𓆟 vs ⸜(。˃ ᵕ ˂ )⸝♡𓆝 𓆟 𓆞 𓆝 𓆟
H₂ + O₂ ⇌ 2H₂O vs H₂ + O₂ ⇌ 2H₂O
读万卷书不如行万里路 vs 读万卷书不如行万里路
猿も木から落ちる vs 猿も木から落ちる

When i digged into the decode() method of ChatTokenizer, i found that you use an own replace_hex() method in tokenizer.py:229, which decodes bytes one-by-one. In my opinion, this leads to the output above, where unicode-characters are decoded bytewise instead of being decoded to the expected unicode character.

I have a fix locally and will provide a PR for this by fork

99991 commented 7 months ago

Thank you for the PR!