OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

Splitting མངས་བས་ wrong? #104

Open lothelanor opened 1 year ago

lothelanor commented 1 year ago

མངས་བས་ should be split as: མངས་བ/n.v.past + ས་/case.agn (POS added for illustration), but botok somehow splits it as མང + ས་བ + ས་ which seems odd? ས་བ exists as a noun of course, but the other case seems more common?

Similarly with དམ་བཅས་པ་