GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
75 stars 29 forks source link

AnnotationSetImpl.remove() leaves nodesByOffset inconsistent #87

Closed greenwoodma closed 5 years ago

greenwoodma commented 5 years ago

if you call AnnotationSetImpl.remove(annotation) after the nodesByOffset map has been built by AnnotationSetImpl.indexByStartOffset() then while the annotation is removed from the set, nodesByOffset isn't recomputed.

An example of where this is a problem is as follows. Let's assume you are processing tweets and have found a @mention. You then decide that you want to simplify the Token annotations so that instead of potentially many (@mentions can contain numbers and underscores which would be separate tokens) you want there to be just two; one over the @ and one over the rest. So you do something like

AnnotationSet tokens = gate.Utils.getContainedAnnotations(inputAS, matchAnnots, "Token");
Annotation first = tokens.get(matchAnnots.firstNode().getOffset()).iterator.next();
tokens.remove(first);

which should

  1. get all the Token annotations within the matchAnnots (which I'm assuming is the UserMention annotation)
  2. get the Token annotation that starts at the offset of the beginning of matchAnnots, i.e. the Token spanning the @
  3. removes the Token over the @ from the annotation set (but not the document)

This does work as expected in that between the first and last line of code the tokens annotation set shrinks by one annotation. The problem is that if you then do

boolean aligned = tokens.firstNode().getOffset() == matchesAnnot.firstNode().getOffset()

You'll find that aligned is true, as the nodesByOffset map that powers firstNode hasn't been updated when the annotation was removed and it still points to the node prior to the @ and not the node after the @ and at the beginning of the earliest annotation within the tokens annotation set.

Depending what your code does next, this may or may not be a problem, but if you make any use of firstNode then you'll get the wrong result. Similarly removing the last annotation from a set would have a similar affect on the result of lastNode.