Speed up preprocess with multiprocessing

NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.

Apache License 2.0

3.84k stars 897 forks source link

Speed up preprocess with multiprocessing #787

Open matthew-z opened 5 years ago

matthew-z commented 5 years ago

[x] I checked to make sure that this is not a duplicate issue
[x] I'm submitting the request to the correct repository (for model requests, see here)

Is your feature request related to a problem? Please describe.

Current preprocessor only utilises one CPU core, so it is quite slow on large datasets (e.g., it takes 40 mins to pre-processing robust04 dataset)

Describe the solution you'd like

add a multiprocessing version of apply_on_text_*. E.g.: https://stackoverflow.com/a/53135031

drawbacks: While is might be easy to apply multiprocessing for NLTK tokenizers, it is a little bit tricky to do so for spacy.

matthew-z commented 5 years ago

BTW, for large dataset, memory may become the bottleneck. For example, I am preprocessing 300,000 documents and it requires more than 80GB using mz.preprocessors.DSSMPreprocessor. I wonder if there is any workaround?

uduse commented 5 years ago

Since mz.DataPack uses pandas.DataFrame to store data, it is possible to use Dask to store the data instead. This was once on our schedule but I don't think it will be supported later since the lack of manpower.

(also, you may use an eternal database to fix the problem)

paciops commented 4 years ago

Hello, I have some free time in the next weeks and I would like to implement the multiprocessing version of apply_ontext* and of the preprocessors (starting with the BasicPreprocess), could it be helpful?

uduse commented 4 years ago

@paciops Hi, this project is no longer actively maintained. However, if you would like to contribute, I can add you as a collaborator so you can submit PRs and merge your code much easier.

paciops commented 4 years ago

Yes I would like