UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
200 stars 43 forks source link

Documentation of lemmatization decisions in English corpora #131

Open nschneid opened 3 years ago

nschneid commented 3 years ago

We should write a summary if this does not already exist, since https://universaldependencies.org/u/overview/morphology.html#lemmas says 'treebanks have considerable leeway in interpreting what “canonical or base form” means'.

For example, e9884f8 removed plural endings from lemmas in proper names: knight~s~ inn and kc chief~s~. Was this intentional? (@jheinecke)

It also removed plural endings from pluralized ethnonyms like Caucasians, which was intentional. These are also tagged PROPN.

I know other edge cases with lemmas arise from time to time.

nschneid commented 3 years ago

It seems EWT does not lowercase capitalized forms for proper noun lemmas. Another question is whether lowercase forms that are conventionally capitalized should have capital lemmas, and if so whether this counts as a typo or abbreviation.

Other issues:

113 - numbers

99 - quotation marks

69 - abbreviations

amir-zeldes commented 3 years ago

FYI in GUM we also singularize, but don't decapitalize NNPS lemmas, a practice I think we took over from the behavior of TreeTagger in the distant past.

On some level, I think it makes sense that the lemma of "Investigations" in "Federal Bureau of Investigations" is "Investigation", since it's a compositional, transparent plural, and for other NNPS that are opaque names we generally want to keep caps (e.g. "Sony" is the lemma of "Sony"). So taking these two desirables together (if we agree they're desirable), would seem to lead to the practice you're describing, right?

nschneid commented 3 years ago

Spelling out details of capitalization policies:

Regarding inflectional morphology in lemmas within proper names:

Lemmas of pronouns:

Documented at https://universaldependencies.org/en/pos/PRON.html

Lemmas of numbers:

Lemmas and spelling variation:

PS the actual expansion of FBI is singular Investigation, but that's not really relevant here. :)

amir-zeldes commented 3 years ago

Oh, thanks, I learned something there :)

Note that some of these behaviors are not shared with GUM:

arademaker commented 3 years ago

some inconsistencies in the lemmatization of PRON?

ar@leme ud-english-ewt % awk '$4 ~ /PRON/ {print $2,$3,$4}' *.conllu | sort | uniq -c | sort -nr
...
 390 our we PRON
   4 ours ours PRON
...
 758 your you PRON
   1 your your PRON
kanayamah commented 3 years ago

@arademaker mine, yours and theirs are kept as they are, so should ours.

AngledLuffa commented 3 years ago

Any decision on capitalizing nationalities etc used as an adjective? American, for example. I could put that together as a pull request too

nschneid commented 3 years ago

If we have a scalable way to distinguish proper adjectives from other adjectives, IMO it would be good to treat them similarly to proper nouns, which would preserve the initial capital if it is present in the wordform. So this would update the second bullet in my list above.

AFAIK the lemmas do not actively capitalize proper nouns that have been spelled without capitalization.

AngledLuffa commented 3 years ago

I think it just needs to be done once for ewt, maybe once for some of the other treebanks. Anyway, the scaling would be by making a long list of country names and grepping for them.

On Fri, Apr 9, 2021, 11:19 AM Nathan Schneider @.***> wrote:

If we have a scalable way to distinguish proper adjectives from other adjectives, IMO it would be good to treat them similarly to proper nouns, which would preserve the initial capital if it is present in the wordform. So this would update the second bullet in my list above.

AFAIK the lemmas do not actively capitalize proper nouns that have been spelled without capitalization.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/131#issuecomment-816869811, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMODWQMTFEZPYYC77LTH5AJNANCNFSM4YFSXQZA .

amir-zeldes commented 3 years ago

In GUM, lower case proper names are upper cased in the lemma if the name is usually spelled capitalized (example: sony -> lemma=Sony). Nationalities are also capitalized in adjective lemmas.

To find a good list of these, you can also check all NORP entities in the enamex layer of OntoNotes, which mostly correspond to demonym adjectives.

AngledLuffa commented 3 years ago

African-American - capital, capital? African ringneck - capital, lowercase? Alexandrine parrot - ??? Indian capital Delhi - capital, lowercase, capital? Kolkata is an Indian state - capital, ..., capital, lowercase? and if it's misspelled kolkatta, it gets turned into Kolkata? the only West Indian spot - capital, capital? or is West lowercase? a huge West Indian population - same question, I guess American palette - capital, lowercase? authentic Jamaican food - lowercase, capital, lowercase?

On Fri, Apr 9, 2021 at 11:55 AM Amir Zeldes @.***> wrote:

In GUM, lower case proper names are upper cased in the lemma if the name is usually spelled capitalized (example: sony -> lemma=Sony). Nationalities are also capitalized in adjective lemmas.

To find a good list of these, you can also check all NORP entities in the enamex layer of OntoNotes, which mostly correspond to demonym adjectives.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/131#issuecomment-816891295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWONELASW4CLLU5DZK3TH5ERBANCNFSM4YFSXQZA .

nschneid commented 3 years ago

Those all sound good to me. I would include terms like West (Indian) or Mr. (Smith) as part of the name and therefore capitalized.

AngledLuffa commented 3 years ago

Not entirely sure why I took this on. Must be a chip on my shoulder for being the Nth EWT annotator out of N. It's pretty tedious though.

More possible ambiguities: Aryan, Masonic - capitalized? air asia -> Air Asia southeast asian -> Southeast Asian central asian republics -> Central Asian republic irish coffee -> Irish coffee canan t3i (a product name, misspelled) U$ -> US Panamal Canal -> Panama Canal (is currently Panamal Canal) Suez canal -> Suez Canal Soul food? weirdly capitalized as Soul Food, but lemma should still be "soul food" christmas? Goes to Christmas?

AngledLuffa commented 3 years ago

also, satanic & satanism?

amir-zeldes commented 3 years ago

These all look good to me. FWIW Wikipedia capitalizes Satanism and Satanic: https://en.wikipedia.org/wiki/Satanism

Here's another one for discussion: what should we do about "eldest":

It's tagged JJS and looks pretty compositional, but obviously there is also "oldest"...

nschneid commented 3 years ago

eldest: lemma=old, OldFashioned=Yes, Human=Yes ;)

amir-zeldes commented 3 years ago

OK, going with old. This all makes lemmatization a much more interesting layer to predict/use as a feature!

AngledLuffa commented 3 years ago

There is a word "elder", though, which I wouldn't consider a form of "old". "The village elder". Maybe it should also be "old", though

nschneid commented 3 years ago
AngledLuffa commented 3 years ago

Running into some alternate word forms and abbreviations

christmas -> Christmas easter -> Easter halloween -> Halloween christ as a swear word... Christ? anti-american -> anti-American ? latin - american -> Latin - American ?

Antichrist or antichrist? company name such as American Pride Irrigation & Landscaping - all capitals? u.k goes to...? rep.ireland ...?

Calif abbreviation kept as Calif ? publication title such as The American Conservative ... The and Conservative?

nschneid commented 3 years ago

For general vocabulary used in names (titles, companies, etc.), the common/proper distinction is tricky. @amir-zeldes what does GUM do for lemma capitalization in such names?

nschneid commented 3 years ago

My gut feeling is that capitalization in the lemma shouldn't be too context-specific—so "the" should always have a lowercase lemma even in names. But I don't know what the simplest thing is in practice.

AngledLuffa commented 3 years ago

A bunch more possible edge cases in country names:

french fries ... French? german shepherd ... German shepherd? Indian ringneck ... Indian? Guinea pig? Iranian Navy? capital N or not? Islamism? Islamist? phila pa ... Phila PA? norweigan cruise line -> Norweigan royal carribean -> Royal Carribean ? chicken parmesan -> chicken parmesan ? neoplatonic Shakespearian Siamese lynx spartan conditions pad thai

AngledLuffa commented 3 years ago

The resolution of some of these questions aside, I changed a whole bunch of them in this pull request:

https://github.com/UniversalDependencies/UD_English-EWT/pull/144

nschneid commented 3 years ago

You could check a dictionary or style guide for some of these. I would lowercase "guinea pig" but in most cases I share your intuitions.

amir-zeldes commented 3 years ago

For general vocabulary used in names (titles, companies, etc.), the common/proper distinction is tricky. @amir-zeldes what does GUM do for lemma capitalization in such names?

The real question is a tagging question IMO: If we go with NNP/PROPN then if the conventional spelling of that name/brand/whatever is capitalized, the lemma will be capitalized as well (and singularized if NPS).

You've inspired me to do some consistency checks on GUM, so I'm seeing we weren't always 100% consistent on this, but that was the intention, and I should be able to fix a bunch of cases.

amir-zeldes commented 3 years ago

@AngledLuffa I would do everything the same as you except:

Note also for spartan, dictionary.com concurs the non-demonym usage is lower-cased, so I suppose it's just a lexicalized adjective and no longer supposed to be referencing Sparta.

Oh, and yes, "elder" can be a noun, also based on the existence of the plural "elders", so that's a separate lemma

AngledLuffa commented 3 years ago

Oops, completely missed the correct spelling of Norwegian. Which brings up another question: Norwegian Forest Cat. Capitalize Norwegian Forest, no forest, or also Cat?

nschneid commented 3 years ago
  • phila -> Philadelphia (probably with Typo=Yes)

Abbr=Yes, I think. Probably not accidentally shorter.

Oops, completely missed the correct spelling of Norwegian. Which brings up another question: Norwegian Forest Cat. Capitalize Norwegian Forest, no forest, or also Cat?

I'd lean toward the conservative side and just capitalize Norwegian. (Similarly: Labrador retriever, etc.)

AngledLuffa commented 3 years ago

Slight disagree based on the wikipedia article, but I don't feel strongly:

https://en.wikipedia.org/wiki/Norwegian_Forest_cat

amir-zeldes commented 3 years ago

The Wiki convinced me, I think it's [[Norwegian Forest] cat] (a cat from the the Norwegian Forest), rather than a Norwegian variety of [forest cat], so Norwegian Forest could be a place name?

amir-zeldes commented 3 years ago

publication title such as The American Conservative ... The and Conservative?

Forgot to say about "The" & co - the PTB guidelines tag function words in names normally, so the lemmatization practice derived from that has been to lemmatize them normally. For example:

The/DT/the Lord/NNP/Lord of/IN/of the/DT/the Rings/NNPS/Ring

This is what we do in GUM in any case - only the NNP-tagged words receive capitalized lemmas as relevant.