suggest to use a more space efficient tokenizer for Chinese.

revive commented 1 year ago

The default tokenizer used in current RWKV pre-trained model is based on GPT-NeoX, which is not efficient enough for Chinese language.

For example, we tokenize a sentence like: "我喜欢敦煌，那里有湛蓝的天空和厚重的历史。", which contains 21 Chinese characters in UTF-8. The tokenized list contains 33 tokens.

I suggest to use the tokenizer from BLOOM, using the tokenizer from BLOOM to encode the same sentence, the resulted list contains only 14 tokens. The same number of tokens contains as twice number of Chinese characters as using the default tokenizer.

The bad thing is one has to construct the pre-trained model from start.

BlinkDL commented 1 year ago

Yeah please see https://twitter.com/BlinkDL_AI/status/1649839897208045573

gg22mm commented 1 year ago

打不开呢，不能翻墙怎么办，有其它的说明吗？ / What if I can't open it and cannot climb over the wall? Is there any other explanation?

revive commented 1 year ago

打不开呢，不能翻墙怎么办，有其它的说明吗？ / What if I can't open it and cannot climb over the wall? Is there any other explanation?

It is an announcement of the new tokenizer. https://github.com/BlinkDL/ChatRWKV/tree/main/tokenizer

BlinkDL / RWKV-LM

suggest to use a more space efficient tokenizer for Chinese. #108