Multiprocessing tokenization

10zinten commented 4 years ago

fix #64

pep8speaks commented 4 years ago

Hello @10zinten! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file botok/tokenizers/tokenize.py:

Line 26:5: E303 too many blank lines (2) Line 28:23: E741 ambiguous variable name 'l' Line 40:1: W293 blank line contains whitespace Line 57:80: E501 line too long (82 > 79 characters) Line 63:80: E501 line too long (80 > 79 characters) Line 67:5: E303 too many blank lines (2) Line 302:10: W292 no newline at end of file

In the file botok/tokenizers/wordtokenizer.py:

Line 70:80: E501 line too long (82 > 79 characters)

In the file tests/benchmark.py:

Line 10:80: E501 line too long (101 > 79 characters)

In the file tests/test_bugs.py:

Line 170:6: W292 no newline at end of file

In the file tests/test_tokenize.py:

Line 85:80: E501 line too long (91 > 79 characters) Line 89:80: E501 line too long (88 > 79 characters) Line 110:80: E501 line too long (166 > 79 characters) Line 115:80: E501 line too long (88 > 79 characters) Line 178:1: E302 expected 2 blank lines, found 1

In the file tests/test_wordtokenizer.py:

Line 136:80: E501 line too long (104 > 79 characters) Line 144:80: E501 line too long (81 > 79 characters) Line 152:1: W293 blank line contains whitespace Line 171:1: W293 blank line contains whitespace Line 178:80: E501 line too long (97 > 79 characters) Line 182:80: E501 line too long (84 > 79 characters) Line 191:80: E501 line too long (94 > 79 characters) Line 200:1: W293 blank line contains whitespace Line 205:80: E501 line too long (83 > 79 characters) Line 222:1: W293 blank line contains whitespace Line 235:80: E501 line too long (82 > 79 characters) Line 249:1: E302 expected 2 blank lines, found 1 Line 257:35: W292 no newline at end of file

Comment last updated at 2020-07-27 06:08:31 UTC

mikkokotila commented 4 years ago

Looks like it will not parallelize properly.

I'm using the following example from the commit:

from botok import *
profile = "empty"
main, custom = Config().get_tok_data_paths(profile)
tok = Tokenize(Trie(BoSyl, profile, main, custom))
in_str = "མཐའི་བཀྲ་ཤིས། ཀཀ abc མཐའི་རྒྱ་མཚོ་"
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.parallelized_tokenize(preproc)

But I'm replacing in_str with something like this volume.

The result is that the workload occupies only a single thread and wall-clock time is unaffected.

10zinten commented 4 years ago

Interesting, I will look into it soon as possible.

mikkokotila commented 4 years ago

Do you have any update available for this?

mikkokotila commented 4 years ago

Related to this matter, here is a working code example for running Botok in a multiprocessing manner.

OpenPecha / Botok

Multiprocessing tokenization #70

Comment last updated at 2020-07-27 06:08:31 UTC