Annotation instructions for CPH data

nschneid commented 9 years ago

What were annotators told for the CPH supersense-labeled tweets? I need to give my annotator a policy for various tweet-centric phenomena.

Differences I am noticing between Lowlands and Ritter datasets:

Lowlands consists mostly of promotional material/shares of other content. Most tweets are short, headline-like, and end with a URL. Ritter data is more conversational.
URLs, usernames, numbers, etc. are obscured in Lowlands data, but not Ritter data.
In Lowlands data, numbers at the beginning of a noun phrase are joined to form an MWE.
In Lowlands data, usernames are supersense-tagged as PERSON and URLs as COMMUNICATION. They are not supersense-tagged in Ritter. I would argue that it's probably not the best use of annotator time to assign labels that can be applied deterministically, nor is it interesting to give credit for such labels in our evaluation.
Some Ritter tokens are inexplicably missing supersense labels. E.g. in ritter-train.tsv:
```
follow  NN  O
back    NN  O
```

dirkhovy commented 9 years ago

What were annotators told for the CPH supersense-labeled tweets? I need to give my annotator a policy for various tweet-centric phenomena.

There was very little supervision, since another goal was to explore how much overlap we can get out of annotators without specific training. The only instruction they received was to label all social media as noun.communication Differences I am noticing between Lowlands and Ritter datasets:

Lowlands consists mostly of promotional material/shares of other content. Most tweets are short, headline-like, and end with a URL. Ritter data is more conversational. URLs, usernames, numbers, etc. are obscured in Lowlands data, but not Ritter data. In Lowlands data, usernames are supersense-tagged as PERSON and URLs as COMMUNICATION. They are not supersense-tagged in Ritter. I would argue that it's probably not the best use of annotator time to assign labels that can be applied deterministically, nor is it interesting to give credit for such labels in our evaluation. Yes, we should map them deterministically. These differences are a constant source of trouble, we even wrote a whole paper about the need to sync all preprocessing, and how there's still stuff left. Some Ritter tokens are inexplicably missing supersense labels. E.g. in ritter-train.tsv:

follow NN O back NN O — Reply to this email directly or view it on GitHub.

nschneid commented 9 years ago

Another thing: In Ritter, URLs are POS-tagged as X; in Lowlands they are PROPN. X is the official answer because it covers ADD = email or web address. @dirkhovy, do you want to update the original Lowlands data? Or should I make the update when generating the data for annotation, and we'll release it in the updated training data once annotation is done?

andersjo commented 9 years ago

Dirk changes URLs from PROPN to X in the Lowlands data

nschneid commented 9 years ago

URLs mapped to POS = X and no supersense in a1ce9a7 – 0269fe8

dimsum16 / dimsum-data

Annotation instructions for CPH data #3