UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Capitalized auxiliary lemmas #48

Closed amir-zeldes closed 2 years ago

amir-zeldes commented 2 years ago

I have an issue with the validator complaining about capitalized auxiliary lemmas, which I think are correctly capitalized and correctly treated as an auxiliary based on current guidelines.

Background: In English lemmatization, we decided that PROPN lemmas retain capitalization, since obviously the lemma of "France" should not be "france". However, due to the complexity of work-of-art titles (discussed in the past at length), we agreed to keep all capitalized lemmas as is, leading to the following lemmas:

When names of things contain non-nominal parts of speech, we agreed to tag them as their normal grammatical POS tags and deprels (so PROPN means something is both a name and a noun), but we still lemmatize capitalized as follows:

This policy has pros and cons but I feel it is fairly well understood and consistent in English by now. However, I now have this sentence:

The sentence is tagged and lemmatized following the above guidelines, so the title of the video is tagged and depreled compositionally, with "Don't" as aux + advmod. However, the lemma of "Do" is retained capitalized as "Do" based on the same principle as United States... Which the validator does not accept, since "Do" is not an auxiliary in English (but lowercase "do" is).

@dan-zeman What is the best way to resolve this? Am I misunderstanding any of the above English guidelines, or do they actually clash with the validator's auxiliary filtering policy? @nschneid is my overview of name lemmatization in English above correct or am I misremembering anything? Thanks for any suggestions!

dan-zeman commented 2 years ago

Technically it would not be a problem to allow uppercase in the AUX-registering system and it might be needed in the future (if English has an uppercase pronoun, and German has capitalized nouns, maybe there is a language that dictates to capitalize an auxiliary...)

But I don't want to have Do and do as two different auxiliaries in English.

And to be honest, I quite dislike the guidelines you outlined above. I agree that if Do is part of a movie title, it is AUX and not PROPN. But then I see absolutely no reason why its lemma should be different from the do that occurs in a normal sentence (such as I want to know how to avoid talking to people you don't want to talk to).

amir-zeldes commented 2 years ago

Thanks @dan-zeman - I think the issue is that then the lemma of United in United States would have to be lowercase "unite", which may seem jarring to English speakers. Recall that initially its lemma was "United", which clashed with using deprel amod, so when we agreed to do compositional analyses of names like that, we basically came up with 'capitalized lemmas' as the compromise, IIRC. For nouns it makes a lot of sense to me (Mona Lisa Smile/Smile), and for verbs it typically looks good (The movie " Sing/VERB/Sing "), but admittedly for function words it's less compelling.

I don't necessarily want to re-open that lengthy discussion again, since I think it took us quite a while to reach the current consensus, I'm mainly looking for a way to satisfy the validator and stay within the guidelines...

nschneid commented 2 years ago

Should function words be an exception to this capitalization policy? Maybe the policy should be that content words in titles retain capitalized lemmas but the occasional capitalized AUX, DET, PART, etc. should be lowercase even in a title.

nschneid commented 2 years ago

@amir-zeldes in https://github.com/UniversalDependencies/UD_English-EWT/issues/131#issuecomment-819050887

publication title such as The American Conservative ... The and Conservative?

Forgot to say about "The" & co - the PTB guidelines tag function words in names normally, so the lemmatization practice derived from that has been to lemmatize them normally. For example:

The/DT/the Lord/NNP/Lord of/IN/of the/DT/the Rings/NNPS/Ring

This is what we do in GUM in any case - only the NNP-tagged words receive capitalized lemmas as relevant.

dan-zeman commented 2 years ago

@amir-zeldes not reopening that discussion sounds wise and relieving :-) I think I've participated in many lengthy discussions on this and similar topics over the years!

There is probably a scale. In "John Smith", I would say that Smith should be PROPN and its lemma capitalized. In a clause-like movie title that features a smith (occupation), I would argue that it is NOUN and its lemma is lowercased, despite the actual form in the sentence being capitalized. "United States" seems to be somewhere in the middle on this scale. (I would tend to be consistent and judge the word United the same way in UPOS, LEMMA and DEPREL... but let's keep that discussion closed if the current solution works.)

The smith-NOUN above is something I would recommend but it would force you to find a dividing line between that and United States, and it is not essential for the problem at hand. What @nschneid suggests about the function words should calm down the validator for the moment.

amir-zeldes commented 2 years ago

Ok, that makes sense, so we'll say the function words category is lowercased, including verbs used as auxillaries (the reason this case originally had cap was because the policy was formulated based on xpos within GUM, and this is technically a verb, but we do know the deprel too, so it's easy to see this is AUX)