Open cowchipkid opened 7 years ago
In am reprocessing my corpus of documents, and I am seeing cases where the tokenizer takes hours. Huge documents (2M), with half million tokens each, just choking the system. The tokenizer itself runs order N, but this sorting in the addConstituent method is a big problem. The fix is easy. Take it out, make this a separate method. But what does that break?
Each time we add a constituent to the SpanLabelView, we sort the results after appending the new constituent. This appears to be massively inefficient for very large files. When create the text annotation from tokenized text, this sort gets done over and over and over, when in fact, it would be much more efficient if it were only done once.