UniversalDependencies / UD_Portuguese-PUD

Parallel Universal Dependencies.
Other
5 stars 3 forks source link

Proper Nouns not considered proper #12

Closed vcvpaiva closed 9 months ago

vcvpaiva commented 3 years ago

11 Estados NOUN NN Gender=Masc|Number=Plur 4 nmod Proper=True 12 Unidos ADJ JJ Gender=Masc|Number=Plur 11 amod Proper=True

"Estados Unidos" (United States) not considered a proper noun?

the file stats tells us that PUD-EN has 1727 occurrences, while PUD-PT has 1352 occurrences. I would expect the same (or similar number) of occurrences.

It is true that many words considered proper nouns in English are not considered so in Portuguese, for example Monday=segunda-feira.

but it's not clear to me how the annotation works in cases like "Universidade da Carolina do Norte". in PUD-EN University, Carolina and North are all proper nouns, while in PUD-PT only Carolina is considered a proper noun.

arademaker commented 3 years ago

I don't know how this corpus was constructed, but I found that many cases are marked in the MISC with Proper=True. @dan-zeman do you know the reason for that?

Anyway, I made a few changes. But United States is easy. What about University of North Caroline? In Bosque we previously adopted the idea of flat:name pointing to the first word on the left, for all the words, including prepositions. Now I fell like we should have prepositions linked with case and content words connected to the first left as flat:name.

dan-zeman commented 3 years ago

First of all, any UPOS tag (including PROPN) refers to just one word, not to a phrase. So it is possible to have a multi-word phrase which denotes a named entity, although none of its words is a proper noun. Unfortunately, even if you look only at a single word, opinions sometimes differ whether it is or is not a proper noun. The exact criteria will definitely differ across languages (Monday is a proper noun in English but it is a common noun in some other languages; whether the same can be said about North in English, I am not sure).

For me, even United States is not so easy because state is a common noun and united is not even a noun, it is an adjective (or verb – unite). I don't find the PROPN tag justified but annotators of English data use it in this case for both the words (and connect them via the compound relation). The guidelines for PROPN admit that this practice in English has been inherited from the Penn Treebank. On the other hand, the guidelines clearly say (for example here) that names with regular syntactic structure should use regular syntactic relations; while that statement does not explicitly talk about UPOS tags, the tags obviously have to match the syntactic relations somehow.

If United States is analyzed as two proper nouns in English, then perhaps the same analysis may apply to North Carolina but again, this is not something that should necessarily be mimicked in other languages. However, tagging University as PROPN is clearly wrong from my point of view (unfortunately that, too, occurs in UD English EWT).

For what it's worth, in Czech, Spojené státy "United States" is a common noun pre-modified by an adjective, connected via amod.

vcvpaiva commented 3 years ago

this is following the EN guidelines, so closing the issue

arademaker commented 3 years ago

Sorry, I would like to keep it open to check the consistency between PT treebanks.

vcvpaiva commented 9 months ago

@arademaker I'm closing the issue as you can open the issue in the other Portuguese treebanks, the PUD is following guidelines.

arademaker commented 9 months ago

Eventually we want a uniform approach among treebanks in the same language. But fine for now since I am focusing on the gsd treebank for this next release.