ASUS-AICS / LibMultiLabel

A library for multi-class and multi-label classification
MIT License
152 stars 30 forks source link

Change from concurrent.futures to multiprocessing #354

Open donglihe-hub opened 9 months ago

donglihe-hub commented 9 months ago

What does this PR do?

Change from a higher level interface to a lower one because:

  1. As described in https://github.com/python/cpython/issues/105829, when there are many tasks submitted to a concurrent.futures.ProcessPoolExecutor pool, there is probability that deadlocks will occur with CPython. The same example run with multiprocessing.pool.Pool had no such problem.
  2. The issue has been fixed in Python>=3.11.6 and >=1.12.1. For the sake of compatibility, I rewrote the parallel tokenization using multiprocessing.

Trying to find out the best num_processes

I tested tokenization with various num_processes (using RegexTokenization):

linux, fork num_processes AmazonCat-13K EUR-Lex Wiki10-31K
no parallel 108.49 s 6.66 s 10.31 s
2 92.31 s 5.81 s 9.92 s
4 61.62 s 3.80 s 6.01 s
8 59.44 s 3.20 s 4.65 s
16 49.05 s 3.90 s 5.72 s
32 52.18 s 3.44 s 4.96 s
64 57.61 s 5.37 s 7.09 s
128 86.96 s 8.54 s 11.16 s
256 119.51 s 14.57 s 19.10 s
Adds-on: I re-ran the codes again. This time 16 had the best perfomance in all cases. num_processes AmazonCat-13K EUR-Lex Wiki10-31K
no parallel 97.36 s 6.50 s 10.57 s
8 62.09 s 4.09 s 6.29 s
16 42.34 s 3.59 s 5.67 s
32 69.60 s 4.72 s 6.70 s

Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.

Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).

I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).

win32, spawn

num_processes EUR-Lex Wiki10-31K
no parallel 6.02 s 15.61 s
2 14.85 s 23.68 s
4 17.57 s 19.42 s
8 20.08 s 23.44 s
12 26.21 s 35.86 s

I also tested spawn on linux

linux, spawn num_processes EUR-Lex Wiki10-31K
no parallel 6.61 s 10.64 s
2 8.41 s 12.67 s
4 6.07 s 8.60 s
8 5.88 s 8.65 s
12 5.36 s 6.81 s
16 5.42 s 6.85 s
32 6.36 s 8.86 s

It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.

Test CLI & API (bash tests/autotest.sh)

Test APIs used by main.py.

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

Test quickstart & API (bash tests/docs/test_changed_document.sh)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.