Closed seasonlaw closed 5 years ago
This is actually a bug in RepositioningInfo
in gate-core, which is the class we use to track the mapping between the unescaped annotation offsets in the Document
and the escaped "indices" in the JSON.
That said, the format-twitter plugin is supposed to be compatible with gate-core 8.5 so I've also added a workaround (https://github.com/GateNLP/gateplugin-Format_Twitter/commit/387487bb351e283205eb7826057923517502a609) for this bug in the plugin (version 8.6-SNAPSHOT) so it will work with earlier versions of GATE.
Hi, I have used this plugin to process tweets and I found a small bug when I export the annotation to json file. The exporter did not extract the indices of the last token of tweets properly when a tweet ends with a HTML escaped character. For example: When I exported The tweet "Y NACI PARA ALENTARTE Y SEGUIRTE A TODAS PARTES [ E \>", the last token is annotated as Punctuation in my pipeline. But in the exported json file the data is {"indices":[52,-1],"string":">"}. The end offset is miscalculated. This also happened when other characters like "\&, \<" appeared at the end of tweets.