This is great. Can we build a multilang tokenizer?

alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

MIT License

528 stars 20 forks source link

Closed BlinkDL closed 1 year ago

BlinkDL commented 1 year ago

Hi your work looks great. Is it doing greedy tokenization in the sense that always picking the longest possible token?

alasdairforsythe commented 1 year ago

Yes, it's a greedy tokenizer. I'm implementing an ungreedy version atm. For multilanguage, it will work out-of-the-box, I just need decent datasets.