OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

Missing character when updating from pybo 0.4.0 to pybo 0.6.0, BoTokenizer to WordTokenizer #52

Closed aninrusimha closed 5 years ago

aninrusimha commented 5 years ago

With pybo 0.4.0 and the BoTokenizer I'm able to tokenize the text that I'm working with. With pybo 0.6.0 and the WordTokenizer I get the following error.

!pip install pybo==0.6.0 tok =pybo.WordTokenizer('POS') ... tokens = [t for t in tok.tokenize(f.read()) if t.type != "non-bo" and t.pos != "punct"]

ValueError: The char "࿖" is expected to be in the tibetan table, but is not.

four.pdf six.pdf Attached are some outputs for jupyter notebooks, with the pybo version manually changed.

drupchen commented 5 years ago

Thanks for the issue: I just saw that there are a few chars missing from my new implementation of the unicode table. I will include them as soon as possible.

I also need to update the Readme and the documentation after the big refactoring that ended with the 0.6.1 release (yesterday). For example, you should not write anymore t.type != "non-bo", but t.chunk_type != "NON_WORD". Now, all the available variables are conveniently hard-coded in this file.

If you're ok with it, you could send me pybo-related code so I can update it with the new syntax, before our documentation is fully updated. If you don't feel posting it here, send it to me at hhdrupchen@gmail.com

drupchen commented 5 years ago

Please check that the missing char now parses as expected, with the 0.6.3 release. Given the changes in my previous msg, you shouldn't have any problem anymore.

drupchen commented 5 years ago

solved:

drupchen@drupchen-Inspiron-5558:~$ pybo string "࿖ བཀྲ་ཤིས་བདེ་ལེགས།།"
Loading Trie... (2s.)
࿖_ བཀྲ་ཤིས་ བདེ་ལེགས །།
aninrusimha commented 5 years ago

Thank you!