OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

fix(resources): Create bo_punct_position.csv #95

Closed ngawangtrinley closed 1 year ago

ngawangtrinley commented 1 year ago

KNOWN LIMITATION: there are exceptions to the above punctuation group rules such as ། ༈ ། which contains both opening () and closing () punctuation characters but needs to be handled by split into ། ། and , respectively a closing and an opening puntuation group. However these cases are not very common and are ignored for now.

eroux commented 1 year ago

just for reference, I have some code from a few years ago that handles punctuation: https://github.com/buda-base/git-to-dbs/blob/master/src/main/java/io/bdrc/gittodbs/TibetanStringChunker.java