ASUS-AICS / LibMultiLabel

A library for multi-class and multi-label classification
MIT License
152 stars 30 forks source link

Speed Up Tokenization Through Multiprocessing #347

Closed donglihe-hub closed 10 months ago

donglihe-hub commented 10 months ago

What does this PR do?

For large datasets, tokenization can take a lot of time (40 mins for AmazonCat-13K using nltk.word_tokenize). Since Instances are independent from each other during tokenization, multiprocessing can speed up the process.

Currently LibMultiLabel didn't produce any information during tokenization. Users could assume the program gets stuck if they provide a large dataset. Thus, one extra thing I did is adding tqdm to tokenization, which helps users to know what LibMultiLabel is doing.

Test CLI & API (bash tests/autotest.sh)

Test APIs used by main.py.

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

Test quickstart & API (bash tests/docs/test_changed_document.sh)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.

Gordon119 commented 10 months ago

Looks good to me. How about @Eleven1Liu?

Eleven1Liu commented 10 months ago

Good