Open matthew-z opened 5 years ago
BTW, for large dataset, memory may become the bottleneck. For example, I am preprocessing 300,000 documents and it requires more than 80GB using mz.preprocessors.DSSMPreprocessor
. I wonder if there is any workaround?
Since mz.DataPack
uses pandas.DataFrame
to store data, it is possible to use Dask to store the data instead. This was once on our schedule but I don't think it will be supported later since the lack of manpower.
(also, you may use an eternal database to fix the problem)
Hello, I have some free time in the next weeks and I would like to implement the multiprocessing version of apply_ontext* and of the preprocessors (starting with the BasicPreprocess), could it be helpful?
@paciops Hi, this project is no longer actively maintained. However, if you would like to contribute, I can add you as a collaborator so you can submit PRs and merge your code much easier.
Yes I would like
Is your feature request related to a problem? Please describe.
Current preprocessor only utilises one CPU core, so it is quite slow on large datasets (e.g., it takes 40 mins to pre-processing
robust04
dataset)Describe the solution you'd like
add a multiprocessing version of
apply_on_text_*
. E.g.: https://stackoverflow.com/a/53135031drawbacks: While is might be easy to apply multiprocessing for
NLTK
tokenizers, it is a little bit tricky to do so forspacy
.