UniversalDependencies / UD_English-PUD

Parallel Universal Dependencies.
Other
10 stars 2 forks source link

question on tagging of PROPN #3

Open vcvpaiva opened 4 years ago

vcvpaiva commented 4 years ago

In the sentence below, shouldn't "Metropolitan Club" be tagged as PROPN?

sentid = n01003012 text = The gathering was originally slated for Washington’s private Metropolitan Club on H Street a few blocks away. 1 The the DET DT Definite=Def|PronType=Art 2 det 2:det 2 gathering gathering NOUN NN Number=Sing 5 nsubj:pass 5:nsubj:pass 3 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 aux:pass 5:aux:pass 4 originally originally ADV RB 5 advmod 5:advmod 5 slated slate VERB VBN Tense=Past|VerbForm=Part 0 root 0:root 6 for for ADP IN 11 case 11:case 7 Washington Washington PROPN NNP Number=Sing 11 nmod:poss 11:nmod:poss SpaceAfter=No 8 ’s ’s PART POS 7 case 7:case OrigForm='s 9 private private ADJ JJ Degree=Pos 11 amod 11:amod 10 Metropolitan metropolitan ADJ JJ Degree=Pos 11 amod 11:amod 11 Club club NOUN NN Number=Sing 5 obl 5:obl:for 12 on on ADP IN 14 case 14:case 13 H h PROPN NN Number=Sing 14 compound 14:compound 14 Street street PROPN NN Number=Sing 11 nmod 11:nmod:on 15 a a DET DT Definite=Ind|PronType=Art 17 det 17:det 16 few few ADJ JJ Degree=Pos 17 amod 17:amod 17 blocks block NOUN NNS Number=Plur 18 obl:npmod 18:obl:npmod 18 away away ADV RB 11 advmod 11:advmod SpaceAfter=No 19 . . PUNCT . 5 punct 5:punct _

dan-zeman commented 4 years ago

No. PROPN is not the same thing as a named entity. The word club is a common noun. And metropolitan is not even a noun, it's an adjective.

nschneid commented 4 years ago

Following the Penn Treebank, it is typical for nouns and adjectives in named entities to be tagged as PROPN in English corpora. This may not be ideal from a UD perspective but it would be difficult to change. See discussions linked from UniversalDependencies/UD_English-EWT/issues/91

amir-zeldes commented 4 years ago

This is true, though the deprel for adjectival forms is still amod, so at least those modifiers are distinguishable from compound (non-adjectival modifier) and flat (no clear grammatical relation). For heads there is indeed no way to tell if something is a common noun heading a named entity (Dan's named University case) or a noun that is predominantly used as a name (e.g. Jane).

I have argued here that the distinction even between the above two cases is not really tenable outside of prototypical examples:

https://github.com/UniversalDependencies/docs/issues/678

I think if you want purely grammatical POS categories in English you can't rely on PROPN vs. NOUN anyway, see also the issue that Nathan linked to above. In terms of current English POS guidelines, AFAIK Metropolitan Club should be:

Metropolitan/NNP/PROPN/amod Club/NNP/PROPN/obl

vcvpaiva commented 4 years ago

thank you for the previous threads! I still think with @amir-zeldes that the Metropolitan Club is a proper noun the same way Cat in "I saw Cat" (where Cat is a nickname for someone or the play) is a proper noun. Another very bad example of treating it as a common noun is the sentence in PUD-EN:

newdoc id = n01042 sent_id = n01042004

text_en = The Ontario Independent Police Review Director, Gerry McNeilly, set the terms for his review this week after "alarming questions" were raised about how officers interact with Indigenous peoples.

text = O diretor de revisão independente da Polícia de Ontario, Gerry McNeilly, definiu os termos para a sua análise desta semana, após terem sido levantadas "questões alarmantes" sobre a forma como os oficiais interagiam com os povos indígenas.

where the mangled translation shows what may happen with the meaning if you don't say that the whole "Ontario Independent Police Review" is a proper noun.

dan-zeman commented 4 years ago

if you don't say that the whole "Ontario Independent Police Review" is a proper noun

You really cannot say that in UD :-) You may pretend that each of the four words is a proper noun, which apparently is what Penn Treebank does. But the UD guidelines do not cover named entities, hence they give you no means to say that the whole is a named entity. (You can of course add such annotation in the MISC column, and some UD corpora do that. But that is beyond the scope of the UD guidelines.)

(BTW, the translators of the PUD corpus did not see the annotation (it was not ready yet), so it did not matter whether it would or would not be annotated PROPN. Unfortunately, it seems that they also did not see the whole document from which the sentence was taken, so many translations are problematic, not only into Portuguese but to other languages too.)

vcvpaiva commented 4 years ago

thanks for the explanation! but on this issue, I think the guidelines are just wrong. I think it's very difficult to say that the `United States' is not a proper noun, but simply an adjective followed by a common noun, plural, while BID (Business Information District) is a proper noun, because annotators do not know what BID stands for. Shall I close the issue, then?

dan-zeman commented 4 years ago

The Penn Treebank convention aside, I would say that "business", "information" and "district" are common nouns regardless of capitalization, but "BID" is still a proper noun. Not because the annotator may not know how to expand it. But because it is one token, and it stands only for the named entity and nothing else.

nschneid commented 4 years ago

The problem is that UD doesn't have a fully satisfactory treatment of the syntax of multiword expressions—ideally we could represent that internally (at least historically, and with transparent semantics) "United States" is ADJ + plural NOUN, whereas as a phrase it functions as a singular PROPN. Some other kinds of treebanks annotate this in two layers.

dan-zeman commented 4 years ago

UD doesn't have a fully satisfactory treatment of the syntax of multiword expressions

Yep, but that's the point. In case of English by and large, the individual words will be tagged ADP CCONJ ADJ, headed by the first word and connected with the fixed relations. The information that the whole thing functions can be deduced from the relation that attaches the whole expression to the verb (advmod), plus optionally the MISC column may contain MWEPOS=ADV. I would prefer to treat multi-word named entities the same way as other multi-word expressions.

In general, UD lacks means to provide phrase-level annotation. That is not surprising, since UD is a dependency-based rather than phrase-based framework. Yet sometimes it would be useful. Possible addition of a mechanism for phrase-level features was discussed in 2016 during the preparation of the v2 guidelines but in the end it was abandoned because it seemed that the complexity would not be worth it. The problem is that simply adding a feature to the head word would not be sufficient: sometimes the information pertains only to a smaller phrase, not to the entire subtree of that word.

vcvpaiva commented 4 years ago

sorry but "by and large" is an idiom, an exception to how the language works, a corner case. "United States" is the vanilla way the language works, no exception, no corner case.

amir-zeldes commented 4 years ago

This doesn't change anything about the discussion, but actually by and large should be tagged ADV CCONJ ADV IMO - it's originally a nautical expression referring to two ways of setting a boat's sails. 'By' means pointing (nearly) into the wind, and 'large' means with the wind filling the sails from behind, so 'by' and 'large' are two manner adverbials, and together they took on the meaning 'either way (of sailing)' -> 'under most circumstances'.

vcvpaiva commented 4 years ago

happy to learn of the etymological origin of the idiom, but I take it that you don't disagree it's an idiom and that "by" is not used this way outside of nautical circles?

Also hoping that you don't intend your turkers to know the origins of all and any idioms in English?

amir-zeldes commented 4 years ago

Our own data at Georgetown is not produced by Turkers, but mostly comes from trained linguists working in a classroom setting... If they run into something like 'by and large' then they will very often ask what to do, and if not, such things are often caught in QA by the course TA or the instructor (i.e. me :)

But concretely as I wrote above, I didn't mean that this changes anything about the discussion, I just didn't want it on record that the correct tags here are ADP CCONJ ADJ, that's all. The fact that token-wise dependencies can't express properties of phrases is a given, and different language POS tagging guidelines all come to terms with this somehow. The PTB one is maybe not optimal, but no solution is perfect, and at least the PTB one is widely known to people working on English, which means fewer surprises/inconsistencies across datasets.

amir-zeldes commented 4 years ago

Some other kinds of treebanks annotate this in two layers

@nschneid BTW at least for NPs, this is something that entity annotation somewhat covers in GUM (and UD Coptic!), since we have multi row bracketing structures in MISC expressing entities:

1   New New PROPN   NNP Number=Sing 3   nsubj   _   Discourse=preparation:1->4|Entity=(place-1
2   Zealand Zealand PROPN   NNP Number=Sing 1   flat    _   Entity=place-1)
3   begins  begin   VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
...

The values Entity=(place-1 and Entity=place-1) indicate the beginning and end of a multiword entity, for which the first word is now upos=PROPN but deprel=amod ("New"). In the winter we hope to roll out Wikification in GUM, after which you will also be able to get specific identifiers that can also tell you if something is a country, or a city, or the name of a play etc. using Wikidata's API - then it will look like this, with New_Zealand being the identifier of the corresponding Wikipedia page:

1   New New PROPN   NNP Number=Sing 3   nsubj   _   Discourse=preparation:1->4|Entity=(place-New_Zealand
2   Zealand Zealand PROPN   NNP Number=Sing 1   flat    _   Entity=place-New_Zealand)
3   begins  begin   VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
...