For large datasets, tokenization can take a lot of time (40 mins for AmazonCat-13K using nltk.word_tokenize). Since Instances are independent from each other during tokenization, multiprocessing can speed up the process.
Currently LibMultiLabel didn't produce any information during tokenization. Users could assume the program gets stuck if they provide a large dataset. Thus, one extra thing I did is adding tqdm to tokenization, which helps users to know what LibMultiLabel is doing.
Test CLI & API (bash tests/autotest.sh)
Test APIs used by main.py.
[ ] Test Pass
(Copy and paste the last outputted line here.)
[x] Not Applicable (i.e., the PR does not include API changes.)
Check API Document
If any new APIs are added, please check if the description of the APIs is added to API document.
[x] Not Applicable (i.e., the PR does not include API changes.)
Test quickstart & API (bash tests/docs/test_changed_document.sh)
If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.
What does this PR do?
For large datasets, tokenization can take a lot of time (40 mins for AmazonCat-13K using nltk.word_tokenize). Since Instances are independent from each other during tokenization, multiprocessing can speed up the process.
Currently LibMultiLabel didn't produce any information during tokenization. Users could assume the program gets stuck if they provide a large dataset. Thus, one extra thing I did is adding tqdm to tokenization, which helps users to know what LibMultiLabel is doing.
Test CLI & API (
bash tests/autotest.sh
)Test APIs used by main.py.
Check API Document
If any new APIs are added, please check if the description of the APIs is added to API document.
Test quickstart & API (
bash tests/docs/test_changed_document.sh
)If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.