UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Lemma of abbreviation #516

Open nschneid opened 6 years ago

nschneid commented 6 years ago

Spawned off of #513 and UniversalDependencies/UD_English#40.

I proposed:

Another issue relevant here is abbreviations. For uncommon abbreviations/shortened forms (like w for with, btwn for btwn, thru for through), I'm inclined to say we should use the canonical spelling in the lemma and apply the feature Abbr=Yes. For common abbreviations like vs. for versus and etc. for et cetera, perhaps we should keep the surface form in the lemma.

There has been further discussion about single-token abbreviations that would expand to multiple words, and whether to expand frequent single-word abbreviations.

dan-zeman commented 6 years ago

Possibly related old issues: https://github.com/UniversalDependencies/docs/issues/112 https://github.com/UniversalDependencies/docs/issues/181

sanjmeh commented 6 years ago

I want to reiterate this problem of short abbreviation tagging. The classic example is the short form vs. or just v. used in most legal text instead of the full word versus. On annotating the text using udpipe english_ewt model it takes the period inside the token (but still isnt able to lemmatize it to VERSUS while the english_partut treats the period as a separate token and abruptly ends the sentence. So we have a problem here that may be serious enough for legal text.