GateNLP / gateplugin-Twitter

A suite of tools designed for processing Tweets
GNU Lesser General Public License v3.0
1 stars 0 forks source link

Should we assume one Tweet per document? #5

Closed greenwoodma closed 5 years ago

greenwoodma commented 5 years ago

Another inconsistency between the two versions of TwitIE is that the main version puts the detected language in a feature on the Tweet annotation. This means that if we don't have a Tweet annotation the language gets lost (explains why the app for cloud adds the annotation if it doesn't exist). The English only app, however, puts the lang feature onto the document so that the conditional pipeline can use the feature to turn off future processing.

The outcome of this is that the main app can support processing multiple separate tweets inside a given GATE document, and they will be treated independently (at least for the purpose of lang ID). Where as the English only app treats the entire GATE document as a single tweet.

Should both apps behave in the same way? My feeling is that they should both assume one tweet per document, but I'm not sure how others would feel about that. If we do go with one tweet per document then we can do away with creating the Tweet annotation in the cloud app as it's no longer needed (although in this case the language would still get lost in the "test the pipeline" view as we don't show document features).

ianroberts commented 5 years ago

It would be interesting to know how many users outside of Sheffield actually use the facility to have more than one tweet in a single GATE document. I know we rarely if ever do this apart from possibly in the earliest stages of building a new application (though even then, one tweet per doc makes for more meaningful stats on corpus QA) - all our "production" systems use GCP and do everything as one tweet per doc.