alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

This is great. Can we build a multilang tokenizer? #1

Closed BlinkDL closed 1 year ago

BlinkDL commented 1 year ago

Hi your work looks great. Is it doing greedy tokenization in the sense that always picking the longest possible token?

Here's my multilang greedy tokenization experiment FYI: https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py

alasdairforsythe commented 1 year ago

Yes, it's a greedy tokenizer. I'm implementing an ungreedy version atm. For multilanguage, it will work out-of-the-box, I just need decent datasets.