Closed BlinkDL closed 1 year ago
Hi your work looks great. Is it doing greedy tokenization in the sense that always picking the longest possible token?
Here's my multilang greedy tokenization experiment FYI: https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py
Yes, it's a greedy tokenizer. I'm implementing an ungreedy version atm. For multilanguage, it will work out-of-the-box, I just need decent datasets.
Hi your work looks great. Is it doing greedy tokenization in the sense that always picking the longest possible token?
Here's my multilang greedy tokenization experiment FYI: https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py