CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
470 stars 144 forks source link

SpanLabelView.addConstituent inefficient #481

Open cowchipkid opened 7 years ago

cowchipkid commented 7 years ago

Each time we add a constituent to the SpanLabelView, we sort the results after appending the new constituent. This appears to be massively inefficient for very large files. When create the text annotation from tokenized text, this sort gets done over and over and over, when in fact, it would be much more efficient if it were only done once.

cowchipkid commented 7 years ago

In am reprocessing my corpus of documents, and I am seeing cases where the tokenizer takes hours. Huge documents (2M), with half million tokens each, just choking the system. The tokenizer itself runs order N, but this sorting in the addConstituent method is a big problem. The fix is easy. Take it out, make this a separate method. But what does that break?