Open aneesha opened 3 years ago
Hi! Lemmatization is definitely the biggest bottleneck for preprocessing. I didn't know Spacy pipes. It seems the right solution for us, since we already rely on Spacy for the lemmatization.
If you want to contribute, feel free to open a pull request :) Thanks,
Silvia
Thanks - I'll work on this and submit a pull request.
Thank you! Let me know if you have any questions.
Silvia
how are we supposed to generate this vocabulary.tsx file in order to use this dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt') method for preprocessing?
Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.
I'm happy to contribute code to make this change.