Change from a higher level interface to a lower one because:
As described in https://github.com/python/cpython/issues/105829, when there are many tasks submitted to a concurrent.futures.ProcessPoolExecutor pool, there is probability that deadlocks will occur with CPython. The same example run with multiprocessing.pool.Pool had no such problem.
The issue has been fixed in Python>=3.11.6 and >=1.12.1. For the sake of compatibility, I rewrote the parallel tokenization using multiprocessing.
Trying to find out the best num_processes
I tested tokenization with various num_processes (using RegexTokenization):
linux, fork
num_processes
AmazonCat-13K
EUR-Lex
Wiki10-31K
no parallel
108.49 s
6.66 s
10.31 s
2
92.31 s
5.81 s
9.92 s
4
61.62 s
3.80 s
6.01 s
8
59.44 s
3.20 s
4.65 s
16
49.05 s
3.90 s
5.72 s
32
52.18 s
3.44 s
4.96 s
64
57.61 s
5.37 s
7.09 s
128
86.96 s
8.54 s
11.16 s
256
119.51 s
14.57 s
19.10 s
Adds-on: I re-ran the codes again. This time 16 had the best perfomance in all cases.
num_processes
AmazonCat-13K
EUR-Lex
Wiki10-31K
no parallel
97.36 s
6.50 s
10.57 s
8
62.09 s
4.09 s
6.29 s
16
42.34 s
3.59 s
5.67 s
32
69.60 s
4.72 s
6.70 s
Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.
Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).
I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).
win32, spawn
num_processes
EUR-Lex
Wiki10-31K
no parallel
6.02 s
15.61 s
2
14.85 s
23.68 s
4
17.57 s
19.42 s
8
20.08 s
23.44 s
12
26.21 s
35.86 s
I also tested spawn on linux
linux, spawn
num_processes
EUR-Lex
Wiki10-31K
no parallel
6.61 s
10.64 s
2
8.41 s
12.67 s
4
6.07 s
8.60 s
8
5.88 s
8.65 s
12
5.36 s
6.81 s
16
5.42 s
6.85 s
32
6.36 s
8.86 s
It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.
Test CLI & API (bash tests/autotest.sh)
Test APIs used by main.py.
[ ] Test Pass
(Copy and paste the last outputted line here.)
[ ] Not Applicable (i.e., the PR does not include API changes.)
Check API Document
If any new APIs are added, please check if the description of the APIs is added to API document.
[ ] Not Applicable (i.e., the PR does not include API changes.)
Test quickstart & API (bash tests/docs/test_changed_document.sh)
If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.
What does this PR do?
Change from a higher level interface to a lower one because:
concurrent.futures.ProcessPoolExecutor
pool, there is probability that deadlocks will occur with CPython. The same example run withmultiprocessing.pool.Pool
had no such problem.Trying to find out the best num_processes
I tested tokenization with various num_processes (using RegexTokenization):
Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.
Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).
I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).
win32, spawn
I also tested spawn on linux
It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.
Test CLI & API (
bash tests/autotest.sh
)Test APIs used by main.py.
Check API Document
If any new APIs are added, please check if the description of the APIs is added to API document.
Test quickstart & API (
bash tests/docs/test_changed_document.sh
)If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.