NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

Speeding up Document Annotation #53

Closed azucker99 closed 2 years ago

azucker99 commented 2 years ago

Hello, how would you recommend speeding up the time it takes to apply all the annotators to the corpus so that it can scale to larger corpora (i.e., >10,000 documents)? I'm following the convention for defining a combined annotator as shown in your conll2003_ner example but it's proven to be slow even with relatively fast labeling functions. I've attempted a first pass at parallel processing the use of the combined annotator but I did not have any luck. Any suggestions for implementing parallel processing in annotating the corpus or any other methods for scaling the annotation to a larger corpus of documents would be appreciated!

Thank you so much for this awesome library!

plison commented 2 years ago

I guess the bottleneck here is spacy (i.e. converting the raw documents into Doc objects), since the labelling functions themselves are usually quite fast. The easiest thing to do is simply to split the document collection into several parts, save them into separate docbin files (using the from_docbin and to_docbin functions in utils), and then run your pipeline on each file, which will effectively parallelize the processing.

azucker99 commented 2 years ago

This worked, thank you! I'll close out this issue.