UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
201 stars 42 forks source link

Standard abbreviations/acronyms #69

Open nschneid opened 5 years ago

nschneid commented 5 years ago

Ideally we would systematically mark these with the feature Abbr=Yes. Currently this feature is mainly being used for colloquial shortenings ("ppl", "prolly").

Should the lemma spell out the word, e.g. to disambiguate "St." as "Street" vs. "Saint"? What if its an abbreviation of multiple words ("OMFG")?

nschneid commented 5 years ago

Also conventionalized colloquial truncations of words, like "info" for "information", "meds" for "medications", "limo" for "limousine", "fab" for "fabulous", and "physio" for "physiotherapist".

amir-zeldes commented 5 years ago

This is tricky issue, thanks for pointing it out... For comparison, in UD_English-GUM the lemmas do standardize across clear errors, but not abbreviations. One of the main criteria we use is "if the writer had been made aware of the issue, would they have spelled it differently?". Here are some cases where we answered yes:

Items like 'physio' would probably be left alone in GUM as a kind of synonym (essentially the idea is that the writer can choose between the lexical item physio and physiotherapy). One argument is maybe independent morphology: so I'm not sure about 'physios', but I think you can definitely say 'limos', and maybe 'fab' is comparable (fabber? more fab?).

We have plenty of multiword abbreviations and we lemmatize them as themselves (OMG stays OMG). POS choice is also tricky there, and we base it on the expanded form's head (e.g. we've tagged CMV for "Change My View" as an imperative verb!)