GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
75 stars 29 forks source link

RepositioningInfo.getOriginalPos fails at end of last segment (was A bug for Json Exporter (HTML escaped characters problem)) #79

Closed seasonlaw closed 5 years ago

seasonlaw commented 5 years ago

Hi, I have used this plugin to process tweets and I found a small bug when I export the annotation to json file. The exporter did not extract the indices of the last token of tweets properly when a tweet ends with a HTML escaped character. For example: When I exported The tweet "Y NACI PARA ALENTARTE Y SEGUIRTE A TODAS PARTES [ E \>", the last token is annotated as Punctuation in my pipeline. But in the exported json file the data is {"indices":[52,-1],"string":">"}. The end offset is miscalculated. This also happened when other characters like "\&, \<" appeared at the end of tweets.

ianroberts commented 5 years ago

This is actually a bug in RepositioningInfo in gate-core, which is the class we use to track the mapping between the unescaped annotation offsets in the Document and the escaped "indices" in the JSON.

ianroberts commented 5 years ago

That said, the format-twitter plugin is supposed to be compatible with gate-core 8.5 so I've also added a workaround (https://github.com/GateNLP/gateplugin-Format_Twitter/commit/387487bb351e283205eb7826057923517502a609) for this bug in the plugin (version 8.6-SNAPSHOT) so it will work with earlier versions of GATE.