OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

Tokenizer improvement #58

Closed drupchen closed 4 years ago

drupchen commented 5 years ago

The aim of this PR is to: A. solve the memory cost of tries (see https://github.com/Esukhia/pybo/issues/56) B. simplify drastically Tokenize.tokenize() so it becomes maintainable.

A. is solved in 071bdd2f6122ce49f7e4a5c4e67514074719edc5 (see results in https://github.com/Esukhia/pybo/issues/56#issuecomment-522363715)

B. @10zinten, @ngawangtrinley has agreed that you implement it after ending your work on OpenPoti.

It will be easier to implement the maximal matching algo from scratch. For each token found (word or non-word), you can use add_found_word_or_non_word() to generate the Token objects. (See here for ex.)

You can use this failing test as a basis to implement the max-match. Also, ensure the tests in test_tokenize.py, test_bugs.py and test_bugs_missing_tokens.py all pass.

Don't hesitate to reach out for any question.

pep8speaks commented 5 years ago

Hello @drupchen! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 25:80: E501 line too long (82 > 79 characters) Line 30:80: E501 line too long (133 > 79 characters) Line 33:80: E501 line too long (181 > 79 characters) Line 36:80: E501 line too long (169 > 79 characters) Line 41:80: E501 line too long (91 > 79 characters)

10zinten commented 4 years ago

merged here e27e4b88bc68491a00ac899014992e7b80de71b3