Improve Preprocessing Speed

aneesha commented 3 years ago

Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.

for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
      # Lemmatize each token and convert to lower case if the token is not a pronoun
      tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]

      # Remove stop words and punctuation
      tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
      processed_documents.append(tokens)

I'm happy to contribute code to make this change.

silviatti commented 3 years ago

Hi! Lemmatization is definitely the biggest bottleneck for preprocessing. I didn't know Spacy pipes. It seems the right solution for us, since we already rely on Spacy for the lemmatization.

If you want to contribute, feel free to open a pull request :) Thanks,

Silvia

aneesha commented 3 years ago

Thanks - I'll work on this and submit a pull request.

silviatti commented 3 years ago

Thank you! Let me know if you have any questions.

Silvia

SaraAmd commented 1 year ago

how are we supposed to generate this vocabulary.tsx file in order to use this dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt') method for preprocessing?

MIND-Lab / OCTIS

Improve Preprocessing Speed #27