Closed aninrusimha closed 5 years ago
Thanks for the issue: I just saw that there are a few chars missing from my new implementation of the unicode table. I will include them as soon as possible.
I also need to update the Readme and the documentation after the big refactoring that ended with the 0.6.1 release (yesterday). For example, you should not write anymore t.type != "non-bo"
, but t.chunk_type != "NON_WORD"
. Now, all the available variables are conveniently hard-coded in this file.
If you're ok with it, you could send me pybo-related code so I can update it with the new syntax, before our documentation is fully updated. If you don't feel posting it here, send it to me at hhdrupchen@gmail.com
Please check that the missing char now parses as expected, with the 0.6.3 release. Given the changes in my previous msg, you shouldn't have any problem anymore.
solved:
drupchen@drupchen-Inspiron-5558:~$ pybo string "࿖ བཀྲ་ཤིས་བདེ་ལེགས།།"
Loading Trie... (2s.)
࿖_ བཀྲ་ཤིས་ བདེ་ལེགས །།
Thank you!
With pybo 0.4.0 and the BoTokenizer I'm able to tokenize the text that I'm working with. With pybo 0.6.0 and the WordTokenizer I get the following error.
!pip install pybo==0.6.0
tok =pybo.WordTokenizer('POS')
...
tokens = [t for t in tok.tokenize(f.read()) if t.type != "non-bo" and t.pos != "punct"]
ValueError: The char "࿖" is expected to be in the tibetan table, but is not.
four.pdf six.pdf Attached are some outputs for jupyter notebooks, with the pybo version manually changed.