Closed 10zinten closed 1 year ago
Hello @10zinten! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
botok/tokenizers/tokenize.py
:Line 26:5: E303 too many blank lines (2) Line 28:23: E741 ambiguous variable name 'l' Line 40:1: W293 blank line contains whitespace Line 57:80: E501 line too long (82 > 79 characters) Line 63:80: E501 line too long (80 > 79 characters) Line 67:5: E303 too many blank lines (2) Line 302:10: W292 no newline at end of file
botok/tokenizers/wordtokenizer.py
:Line 70:80: E501 line too long (82 > 79 characters)
tests/benchmark.py
:Line 10:80: E501 line too long (101 > 79 characters)
tests/test_bugs.py
:Line 170:6: W292 no newline at end of file
tests/test_tokenize.py
:Line 85:80: E501 line too long (91 > 79 characters) Line 89:80: E501 line too long (88 > 79 characters) Line 110:80: E501 line too long (166 > 79 characters) Line 115:80: E501 line too long (88 > 79 characters) Line 178:1: E302 expected 2 blank lines, found 1
tests/test_wordtokenizer.py
:Line 136:80: E501 line too long (104 > 79 characters) Line 144:80: E501 line too long (81 > 79 characters) Line 152:1: W293 blank line contains whitespace Line 171:1: W293 blank line contains whitespace Line 178:80: E501 line too long (97 > 79 characters) Line 182:80: E501 line too long (84 > 79 characters) Line 191:80: E501 line too long (94 > 79 characters) Line 200:1: W293 blank line contains whitespace Line 205:80: E501 line too long (83 > 79 characters) Line 222:1: W293 blank line contains whitespace Line 235:80: E501 line too long (82 > 79 characters) Line 249:1: E302 expected 2 blank lines, found 1 Line 257:35: W292 no newline at end of file
Looks like it will not parallelize properly.
I'm using the following example from the commit:
from botok import *
profile = "empty"
main, custom = Config().get_tok_data_paths(profile)
tok = Tokenize(Trie(BoSyl, profile, main, custom))
in_str = "མཐའི་བཀྲ་ཤིས། ཀཀ abc མཐའི་རྒྱ་མཚོ་"
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.parallelized_tokenize(preproc)
But I'm replacing in_str
with something like this volume.
The result is that the workload occupies only a single thread and wall-clock time is unaffected.
Interesting, I will look into it soon as possible.
Do you have any update available for this?
Related to this matter, here is a working code example for running Botok in a multiprocessing manner.
fix #64