UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Toys R Us = Toy Be We? #1058

Closed amir-zeldes closed 6 days ago

amir-zeldes commented 1 month ago

I'm running into an issue lemmatizing "Toys R Us" in English. Here are the possibly conflicting guidelines:

What is the right thing to do here?

  1. Add an alternative English copula lemma spelling "Be" - in a way, if we are serious about the capital lemma guideline, this will be necessary for all sorts of names containing "Be" which have a transparent syntactic analysis.
  2. Use the lowercase lemma "be" whenever we have a transparent copula, even if it is part of a capitalized name
  3. Not treat "Toys R Us" (or any name containing a capitalized copula) as transparent, and go with flat - keep in mind that this would also affect very transparent cases, like the novel "I Am a Cat" by Natsume Souseki.

Thoughts?

dan-zeman commented 1 month ago

The problem seems to be that you want to have and not to have a transparent analysis at the same time. I think that one must select one of the following approaches and stick to it:

I think my favorite would be the transparent option, but definitely with lowercase "be" as the lemma of the copula. But I could accept the non-transparent approach, provided it is not mixed with the transparent one.

amir-zeldes commented 1 month ago

The problem is that this has already been discussed extensively for English, and the final decision was what I wrote above:

Again, this is not my preference or a proposal, this is what we settled on after the extensive discussion. So my question is only, given this framework, what's the right thing to do here? I think NOUN AUX PRON is not allowed because NOUN is ruled out by the above. But AUX PRON is still possible under the 'function words' exception, same as PTB xpos. However, lemma is meant to be "Be" based on those guidelines, so we need either a clear exception why it should be "be", or a clear exception why this shouldn't be cop, or an alternative lemma "Be" for the validator (not sure if there are other options I'm missing?)

dan-zeman commented 1 month ago

So my question is only, given this framework, what's the right thing to do here?

OK, then I'll leave it for the other maintainers of English to weigh in. Because I think this framework is wrong and therefore none of the things is right to do :-)

jnivre commented 1 month ago

I agree with Dan. It sounds to me like the language-specific discussion on English has converged on something that conflicts with my understanding of the universal guidelines, although some of these things have perhaps never been codified properly.

Skickat från Outlook för iOShttps://aka.ms/o0ukef


Från: Amir Zeldes @.> Skickat: Monday, October 7, 2024 9:52:56 PM Till: UniversalDependencies/docs @.> Kopia: Subscribed @.***> Ämne: Re: [UniversalDependencies/docs] Toys R Us = Toy Be We? (Issue #1058)

The problem is that this has already been discussed extensively for English, and the final decision was what I wrote above:

Again, this is not my preference or a proposal, this is what we settled on after the extensive discussion. So my question is only, given this framework, what's the right thing to do here? I think NOUN AUX PRON is not allowed because NOUN is ruled out by the above. But AUX PRON is still possible under the 'function words' exception, same as PTB xpos. However, lemma is meant to be "Be" based on those guidelines, so we need either a clear exception why it should be "be", or a clear exception why this shouldn't be cop, or an alternative lemma "Be" for the validator (not sure if there are other options I'm missing?)

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/1058#issuecomment-2397872804, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVSCNEAXOGBYSBXS3X3Z2LYCRAVCNFSM6AAAAABPQUQP3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJXHA3TEOBQGQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

AngledLuffa commented 1 month ago

It does sound like a lot of the people instrumental in UD just said they don't like this particular scheme

Also I wanted to say hello to myself in the future when someone posts on Stanza's github, why is "R" being lemmatized to "Be"

amir-zeldes commented 1 month ago

Well, I think the discussion was spread over a bunch of issues in different repos, but this is a good starting point:

https://github.com/UniversalDependencies/docs/issues/777

And see some issues here and cross-references:

https://github.com/UniversalDependencies/UD_English-PUD/issues/3 https://github.com/UniversalDependencies/UD_English-EWT/issues/91

I also notice some posts about this from @dan-zeman (and one from @jnivre ), so I don't think this policy in English should be too surprising. I think the transparent syntax part is what @dan-zeman wanted, whereas the PROPN/lemma part goes more towards parity with the LDC corpora notion of "namedness", i.e. the one used in the context of NER.

nschneid commented 1 month ago

Based on notes in https://github.com/UniversalDependencies/UD_English-EWT/issues/131#issuecomment-787093974 I don't think we're 100% settled on lemma capitalization rules. For truly closed-class UPOS tags like AUX and PART we probably want to require lowercasing.

("Be" or "R" is a particularly thorny case because of multiple divergences between PTB and UD: the PTB rule is that all non-modal auxiliaries are verbs, and all verbs are content words, and all content words in a proper name are tagged NNP. We do not want to mess with PTB policies in XPOS. But the lemma capitalization policy in UD can take the UPOS into account.)

Also: Technically the CorrectForm should be "ᴙ", right? :D

dan-zeman commented 1 month ago

the transparent syntax part is what @dan-zeman wanted, whereas the PROPN/lemma part goes more towards parity with the LDC corpora

Yep, without trying to verify what exactly I wrote in those threads, I believe this is accurate. I think I've been also consistently opposed to the LDC-related part (I hear the arguments speaking for it, I'm just not willing to give them priority).

jnivre commented 1 month ago

I skimmed through the issues referenced, and it didn’t look like any consensus was reached.

For me, the main point here is that UD does not annotate named entities, which implies that the tag PROPN is reserved for words that are mainly (or only) used as names, which in English in turn implies not taking articles (except in meta-linguistic uses). All other named entity expressions should either be annotated as regular phrases, or using the flat relation if the internal structure is considered opaque (because of borrowings from other languages or just historical language development). If they are annotated as regular phrases, then they should not only have ordinary syntactic relations (as opposed to “flat”) but also ordinary (universal) postags, features and lemmas. I realize, however, that the latter point has probably never been explicitated in the guidelines. In particular, I see that the documentation of the “flat” relations, which explains that “The Lord of the Rings” should be annotated like “the king of Sweden”, doesn’t say anything about postags, features and lemmas.

nschneid commented 1 month ago

TBC, the original question in this issue was about a lemmatization issue that I think can be resolved narrowly, but the general question of the definition of PROPN has come up.

@jnivre and @dan-zeman's perspective is actually reflected in the universal PROPN docs page, which specifies "Cat/NOUN on a Hot Tin Roof". In principle, in the universal guidelines, it seems fine to say that some nouns are inherently proper and thus should be labeled PROPN, while others are common nouns that happen to be leveraged in a proper name, and should remain NOUN.

The problem is that English tagsets/corpora have no tradition of making this distinction. This is both a theoretical problem in that we would need guidelines for the borderline cases (e.g. a single-word named entity derived from a common noun, like "Creed"), and a practical problem of implementation (30K NNP|NNPS tokens in GUM+EWT alone, and the presence of an article is an insufficient test: e.g. "Georgetown University/NOUN", "a Toyota/PROPN"). If somebody wanted to tackle this for English, I think it would entail developing detailed guidelines and a lexicon, and ensuring the presence of entity type annotations for disambiguation ("Cat" the name vs. the animal) (only GUM has these entity types at present).

If they are annotated as regular phrases, then they should not only have ordinary syntactic relations (as opposed to “flat”) but also ordinary (universal) postags, features and lemmas.

This cannot be strictly true (that a PROPN never has dependents other than flat) because there are plenty of phrasal names that contain nested proper names, e.g. "Anne/PROPN of Green Gables".

jnivre commented 1 month ago

I did not mean to imply anything about what relations PROPN words can have. Of course many proper names are part of larger phrases, even phrases that are names (like the one you quote). All I said was that, in a transparent analysis, all words should have their ordinary postags, features and lemmas. And for "Anne", the ordinary postag is PROPN.

jnivre commented 1 month ago

It is an interesting question, however, whether a flat analysis implies that all component words should be tagged PROPN. I can imagine cases where some words are juxtaposed to form a name without being a syntactic phrase, and where some of the words are not proper names. I am not sure I can come up with a convincing example, though. :)

gossebouma commented 1 month ago

The Dutch treebanks use flat for analyzing multiword proper names, and normally label all parts as PROPN. So no attempt is made to annotate van (of) in Van Alebeek as an ADP. (same for determiners) There are interesting exceptions, though. In het Goede Vrijdag-akkoord, (the Good Friday agreement), Vrijdag-akkoord is a flat dependent of Goede, yet it has UPOS=NOUN (as akkoord is a noun). Dutch spelling conventions are quite tricky here, compounds are normally written as a single word, but when the first part of the compound is a multiple word proper name the space is preserved. Another case is names with punctuation symbols, like Stop Aids Now! . The ! is seen as a separate token with UPOS=SYM, yet forms part of the name and thus has dep label flat.

amir-zeldes commented 1 month ago

it didn’t look like any consensus was reached

It may have been in part in meetings, but it was definitely reached - I wouldn't have undertaken the project to consolidate lemma casing in GUM if it hadn't been. I am also not trying to reopen these questions - just to interpret the English guidelines with respect to the conflict above.

I think Nathan's proposal of lowercasing based on upos AUX/PART, should work fine, I would just like that to be normative then.

the tag PROPN is reserved for words that are mainly (or only) used as names, which in English in turn implies not taking articles (except in meta-linguistic uses)

I'm not sure this is so straightforward for English, and I don't want to reopen the English discussion anyway, but if someone is thinking of applying this to other languages as a universal guideline I'd like to point out:

  1. Many languages don't have articles, and they are as diverse as Slavic and Japanese. Coming up with guidelines to explain what it means to mainly be used as a name there seems hard and likely to be inconsistent (I think the State Department is a name, and even if the article makes you say NOUN in English, I don't know how to argue one way or another for Jap. 外務省 "(the?) Foreign Ministry/Gaimusho")
  2. Some languages allow articles on stereotypical proper names in non-metalinguistic contexts (e.g. German "der Hans"), and many nouns habitually appear without them (esp. but not only mass nouns)
  3. In many languages, including English, some contexts neutralize article usage. For example in English compound modifiers, it's impossible to tell if something is article-compatible or not. Is "Wow Air" PROPN PROPN? Or PROPN NOUN because "Air" is a noun (but notice the whole phrase can be used without articles)? Or is it INTJ PROPN, because "wow" is an interjection? And what about names that are bare plurals?

I am also not saying the current situation is trivial in English, but I think cross-linguistically using something like article usage is a murky criterion, and many UD users probably expect PROPN to reflect something semantic like NER (and you can also check definiteness or articles using the FEATS and tree).

I'll go ahead and implement Nathan's solution - I'm leaving this open for a bit just because I don't want to shut discussion down of course.

jnivre commented 1 month ago

I was definitely not suggesting using article usage as a universal criterion. Every language has to be judged on its own internal criteria, and if a language does not have a grammaticalised distinction between common and proper nouns, it can simply use the NOUN tag for all nouns. In fact, the non-obligatoriness of the NOUN-PROPN distinction is my standard example when explaining that, while you cannot invent language-specific upos tags, you don't have to use all tags in all languages.

dan-zeman commented 1 month ago

many UD users probably expect PROPN to reflect something semantic like NER (and you can also check definiteness or articles using the FEATS and tree)

PROPN is definitely related to NER but it classifies one word, so it is not the same as NER when it comes to multiword entities. Czech is one of the languages where articles cannot be used as a criterion because they do not exist. We have a category called proper name in the grammar but it is semantic, it is used in rules for capitalization and it is not a part of speech because it can consist of multiple words. In fact, we were trying to convince people that UD should not have the PROPN category when UD v1 was discussed :-); but since the category is part of UD, we don't want to pretend it does not exist in Czech, because the users would expect it. The distinction is tricky and it is further complicated by the fact that our treebanks are conversions from non-UD annotation, so even if we can come up with acceptable annotation guidelines, we may not be able to enforce them in the data we have. I think the guidelines would be roughly as follows:

Of course there will be numerous cases where it is debatable which of the rules above applies. So far it was convenient to rely on the pre-UD tagging and avoid formulating more precise guidelines but with new treebanks being annotated natively in UD, we won't be able to escape it forever.

amir-zeldes commented 1 month ago

Every language has to be judged on its own internal criteria

Agreed - and I think for English what we have is pretty reasonable, and in any case as Nathan pointed out, it's not really feasible to revise it too much (huge manual effort, not clear that something different is actually better)