Speed Up Tokenization Through Multiprocessing

donglihe-hub commented 10 months ago

What does this PR do?

For large datasets, tokenization can take a lot of time (40 mins for AmazonCat-13K using nltk.word_tokenize). Since Instances are independent from each other during tokenization, multiprocessing can speed up the process.

Currently LibMultiLabel didn't produce any information during tokenization. Users could assume the program gets stuck if they provide a large dataset. Thus, one extra thing I did is adding tqdm to tokenization, which helps users to know what LibMultiLabel is doing.

Test CLI & API (`bash tests/autotest.sh`)

Test APIs used by main.py.

[ ] Test Pass
- (Copy and paste the last outputted line here.)
[x] Not Applicable (i.e., the PR does not include API changes.)

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

[ ] API document is updated (linear, nn)
[x] Not Applicable (i.e., the PR does not include API changes.)

Test quickstart & API (`bash tests/docs/test_changed_document.sh`)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.

Gordon119 commented 10 months ago

Looks good to me. How about @Eleven1Liu?

Eleven1Liu commented 10 months ago

Good

ASUS-AICS / LibMultiLabel