Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from huggingface?

shiroko98 commented 2 months ago

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from https://huggingface.co/RWKV/v5-Eagle-7B-HF/tree/main? I saw that WordpieceTokenizer from tokenization_rwkv5.py uses whitespace_tokenize to split tokens, which seems to be unfriendly to Chinese characters.

shiroko98 commented 2 months ago

I used the above two tokenizers to convert model respectively. The test results are as follows. The first one using tokenization_rwkv_world, and the second one using tokenization_rwkv5.

User: hi

Assistant: Hi. I am your assistant and I will provide expert full response in full details. Please feel free to ask any question and I will always answer it.

User: Du Fu is best known for his lǜshi

Assistant: Du Fu is indeed known for his lǜshi, which refers to his poetry. He was a prominent poet during the Tang Dynasty (618-907) and is considered one of the greatest poets in Chinese history. His lǜshi is characterized by its vivid imagery, emotional depth, and insightful commentary on the human condition. Some of his most famous lǜshi include "Ancient Time," "On the Origin of the World," and "The River-like Heart." His lǜshi is admired for its profound insights into the human experience and its ability to capture the essence of life.

User: hi

Assistant: Hi. I am your assistant and I will provide expert full response in full details. Please feel free to ask any question and I will always answer it.

User: Du Fu is best known for his l��shi

Assistant: Du Fu is indeed best known for his l��shi, which is a type of poem that combines the traditional Chinese verse structure with a more free-flowing style. His l��shi poems often explore themes of love, nature, and the human condition, and are characterized by their vivid imagery and emotional depth. Some of his most famous l��shi poems include "Ancient Time" and "On the Origin of Love."

User: Du Fu's l��shi poems are often compared to the l��shi poems of Li Bai

That indicates the greedy longest-match-first algorithm in the tokenize function of WordpieceTokenizer class may need to be rechecked.

BBuf commented 1 month ago

Use

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from https://huggingface.co/RWKV/v5-Eagle-7B-HF/tree/main? I saw that WordpieceTokenizer from tokenization_rwkv5.py uses whitespace_tokenize to split tokens, which seems to be unfriendly to Chinese characters.

Not equivalent, please use tokenization_rwkv5.py.

BBuf / RWKV-World-HF-Tokenizer

Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from huggingface? #5