UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

Some acronyms are lowercased in the lemmas #422

Open AngledLuffa opened 11 months ago

AngledLuffa commented 11 months ago

There may be some inconsistency in the way lemmas are capitalized. In particular, ID, CD, DM, NGO, and LP are lowercased in their lemma forms throughout the dataset

nschneid commented 11 months ago

Yes, to lay out the issue in more detail:

Most of the lemma-capitalization policies in #131 target names. But #421 raised the issue of all-caps non-name acronyms like "CD" (compact disk) whose lemmas have been lowercased in EWT. Wouldn't it be better if (a) the lemma of "CD" retained capitalization, and (b) occurrences of "cd" were normalized to "CD" in the lemma?

These standardly-capitalized ones may be distinguished from ad hoc or phrasal ones like "OMG" (raised in #131). (Do we want to normalize "omg" to "OMG"?)

amir-zeldes commented 11 months ago

+1 for standardizing acronyms to a single form lemma (I'm fine with OMG upper, but either way it's the same lexical item in both cases, so should be a single lemma if possible)