Closed azucker99 closed 2 years ago
I guess the bottleneck here is spacy
(i.e. converting the raw documents into Doc
objects), since the labelling functions themselves are usually quite fast. The easiest thing to do is simply to split the document collection into several parts, save them into separate docbin
files (using the from_docbin
and to_docbin
functions in utils
), and then run your pipeline on each file, which will effectively parallelize the processing.
This worked, thank you! I'll close out this issue.
Hello, how would you recommend speeding up the time it takes to apply all the annotators to the corpus so that it can scale to larger corpora (i.e., >10,000 documents)? I'm following the convention for defining a combined annotator as shown in your conll2003_ner example but it's proven to be slow even with relatively fast labeling functions. I've attempted a first pass at parallel processing the use of the combined annotator but I did not have any luck. Any suggestions for implementing parallel processing in annotating the corpus or any other methods for scaling the annotation to a larger corpus of documents would be appreciated!
Thank you so much for this awesome library!