Change from concurrent.futures to multiprocessing

What does this PR do?

Change from a higher level interface to a lower one because:

As described in https://github.com/python/cpython/issues/105829, when there are many tasks submitted to a concurrent.futures.ProcessPoolExecutor pool, there is probability that deadlocks will occur with CPython. The same example run with multiprocessing.pool.Pool had no such problem.
The issue has been fixed in Python>=3.11.6 and >=1.12.1. For the sake of compatibility, I rewrote the parallel tokenization using multiprocessing.

Trying to find out the best num_processes

I tested tokenization with various num_processes (using RegexTokenization):

linux, fork	num_processes	AmazonCat-13K	EUR-Lex
no parallel	108.49 s	6.66 s	10.31 s
2	92.31 s	5.81 s	9.92 s
4	61.62 s	3.80 s	6.01 s
8	59.44 s	3.20 s	4.65 s
16	49.05 s	3.90 s	5.72 s
32	52.18 s	3.44 s	4.96 s
64	57.61 s	5.37 s	7.09 s
128	86.96 s	8.54 s	11.16 s
256	119.51 s	14.57 s	19.10 s

Adds-on: I re-ran the codes again. This time 16 had the best perfomance in all cases. num_processes	AmazonCat-13K	EUR-Lex	Wiki10-31K
no parallel	97.36 s	6.50 s	10.57 s
8	62.09 s	4.09 s	6.29 s
16	42.34 s	3.59 s	5.67 s
32	69.60 s	4.72 s	6.70 s

Based on the results, I believe 16 is a reasonable choice for num_processes. For small datasets, a difference of 1 to 2 seconds is negligible. For large datasets like AmazonCat-13K, 16 has the least running time than other settings.

Having said that, the results are device- and system-specific. This means the choice for num_processes might be different, for example, on intel CPU or on Windows (I'm using AMD server CPU and Linux).

I tested multiprocessing on Windows. Since on Windows and MacOS doesn't has "fork" as start method, the running is longer using "spawn" as start method (spawn takes more time to start than fork).

win32, spawn

num_processes	EUR-Lex	Wiki10-31K
no parallel	6.02 s	15.61 s
2	14.85 s	23.68 s
4	17.57 s	19.42 s
8	20.08 s	23.44 s
12	26.21 s	35.86 s

I also tested spawn on linux

linux, spawn	num_processes	EUR-Lex
no parallel	6.61 s	10.64 s
2	8.41 s	12.67 s
4	6.07 s	8.60 s
8	5.88 s	8.65 s
12	5.36 s	6.81 s
16	5.42 s	6.85 s
32	6.36 s	8.86 s

It turned out the support for multiprocessing is more complicated than I think. So I'll limited the use of multiprocessing on Linux only.

Test CLI & API (`bash tests/autotest.sh`)

Test APIs used by main.py.

[ ] Test Pass
- (Copy and paste the last outputted line here.)
[ ] Not Applicable (i.e., the PR does not include API changes.)

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

[ ] API document is updated (linear, nn)
[ ] Not Applicable (i.e., the PR does not include API changes.)

Test quickstart & API (`bash tests/docs/test_changed_document.sh`)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.

ASUS-AICS / LibMultiLabel