GateNLP / gateplugin-Twitter

A suite of tools designed for processing Tweets
GNU Lesser General Public License v3.0
1 stars 0 forks source link

XGAPP Inconsistencies #4

Closed greenwoodma closed 5 years ago

greenwoodma commented 5 years ago

In the normal version of TwitIE we use a standard gazetteer for producing lookups as input to the NE grammars. This means as long as we run it before the NE grammar it can be anywhere in the pipeline. We make use of this fact by putting it prior to the hashtag tokenizer so that we can make use of the lookups when tokenizing (or choosing not to tokenise) hashtags.

In the English only version (i.e. it only runs things if the tweet is in English) for some reason we have the gazetteer wrapped inside a flexible gazetteer. This allows the gazetteer to work on the string features of tokens rather than the original tweet text. I assume this is to that tokens that have been normalized can be matched. The problem is that this means you need to have done tokenization before running the gazetteer, which in turns means you can't use the gazetteer as input to the hashtag tokenization.

To be honest I'm not sure which of these approaches makes the most sense, but it's odd that the two apps are so different, when one is simply meant to be a copy of the other with the language conditional feature turned on for most of the PRs. I think we should aim for consistency but which version is correct/better?