Closed drupchen closed 4 years ago
Hello @drupchen! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
tests/test_trie.py
:Line 25:80: E501 line too long (82 > 79 characters) Line 30:80: E501 line too long (133 > 79 characters) Line 33:80: E501 line too long (181 > 79 characters) Line 36:80: E501 line too long (169 > 79 characters) Line 41:80: E501 line too long (91 > 79 characters)
merged here e27e4b88bc68491a00ac899014992e7b80de71b3
The aim of this PR is to: A. solve the memory cost of tries (see https://github.com/Esukhia/pybo/issues/56) B. simplify drastically Tokenize.tokenize() so it becomes maintainable.
A. is solved in 071bdd2f6122ce49f7e4a5c4e67514074719edc5 (see results in https://github.com/Esukhia/pybo/issues/56#issuecomment-522363715)
B. @10zinten, @ngawangtrinley has agreed that you implement it after ending your work on OpenPoti.
It will be easier to implement the maximal matching algo from scratch. For each token found (word or non-word), you can use
add_found_word_or_non_word()
to generate the Token objects. (See here for ex.)You can use this failing test as a basis to implement the max-match. Also, ensure the tests in
test_tokenize.py
,test_bugs.py
andtest_bugs_missing_tokens.py
all pass.Don't hesitate to reach out for any question.