UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

proper nouns - flat, compound, coordination #487

Closed claudiafreitas closed 6 years ago

claudiafreitas commented 6 years ago

During the annotation of the portuguese material, we had the following issues:

  1. We are using flat whenever we have a PERSON or PLACE name, no matter if it’s possible to recognize a trace of syntax/compositionality. For ex: Nelson Sargento (lit. Nelson Sergeant), a musician, got its name because he was a sergeant. But only Wikipedia knows it… Rio Negro (Black River) got its name because of its dark water... no one remembers this. Proper names of person and places tend to grammaticalize, and it is hard (if possible) to draw the line between grammaticalized cases and transparent ones. We decided to use flat in both cases. Do you agree?

  2. As to PROPN wich are ORGs, we tend to use compound, since they frequently have some kind of compositionality, although recognized as a single sense unit (which is attested by the presence of abbreviations): UPP – Unidade de Polícia Pacificadora (Peacekeeping Police Unit) CPI - Comissão Parlamentar de Inquérito (Parliamentary Inquiry Commission)

  3. As to titles as “Lord of the Rings”: when we use regular syntax to annotate it, as in " Yesterday I saw Catch me if you can", should we have Obj (saw; catch) ? It seems weird to me..

  4. Titles/honorifics should be analyzed using flat. How to deal with coordination in cases such as (Presidents Obama and Thrump), since coordination highlights some syntax? UD Pipe gave us the following analysis to the sentence Here are the eating habits of Presidents Obama and Thrump, which is fine to us:

Coordination (Obama; Trump) Compound (President; Obama)

If we agree with the analysis, then we can have different analysis of the "same" unit (flat vs compound (coordination))... Is that ok?

dan-zeman commented 6 years ago

Ad 1: I tend to agree. Nelson Sargento could be flat even if it weren't a surname. As for Rio Negro, these lexicalizations may work differently in different languages (or even phrase by phrase), but if you feel that the original amod relation has disappeared, then I think it is OK to use flat.

Ad 2: I don't see why compound is better than following the UD guidelines and annotating the syntactic structure, i.e. nmod(Unidade, Polícia), case(Polícía, de), amod(Polícia, Pacificadora).

Ad 3: Catch me if you can is internally a clause but externally it substitutes for a noun phrase, so I think I would agree with obj. Otherwise, a clause filling an object slot would be ccomp, so that is definitely the other option here.

Ad 4: If singular President Obama is analyzed as flat(President, Obama), then coordination should work the same way, i.e. flat(Presidents, Obama); conj(Obama, Trump); cc(Trump, and).

arademaker commented 6 years ago

@dan-zeman, in the case 4, we agree with you. But it is interesting to highlight that in the singular form, the head of the flat structure should be the word President (since with flat all "subsequent words in the expression are attached to the first one", http://universaldependencies.org/u/dep/flat.html). In the plural form Obama is the head of the structure for UDPipe, trained with ud-english model, at least.

I also agree regarding the compound, it is actually less informative use compound. But in that case, it is even more hard to me to find arguments to distinguish the use of compound and nmod.

GPPassos commented 6 years ago

Regarding 1, there are names of people and places with internal ADPs (such as Chica da Silva), and a previous issue shows there's still no standard about this yet: https://github.com/UniversalDependencies/docs/issues/400

About the general issue, while I do understand the point that usually people won't remember the origin of the name, I see three issues:

a) it's pretty suggestive and sounds a little odd. Just tell a small children about a river called Rio Negro and it's very possible that she gets surprised and asks you whether it's actually black. For Nelson Sargento it isn't so obvious because it sounds like a (bit funny) surname. If a Nelson guy is actually a sergeant, we would usually say Sargento Nelson instead, so this is a case where it perhaps makes more sense being flat.

b) I'm not really sure about how far we can go in using what goes on people's mind when they read or tell something. Does having transparent syntax imply that people are always consciously aware of this syntax? If so, this begs the question of how empirical should this actually be and how to experiment.

c) in a more pragmatical note, thinking about a pipeline framework, making things more opaque at syntax level means not having any transparent syntax information later on (e.g. at semantical level). In this way, it would be impossible to recover the blackness of Rio Negro.

On the other hand, having things more transparent means that perhaps there'll be nothing in the syntactical annotation itself that says that certain things are named entities. For instance, in Chica da Silva, perhaps it seems useless to say that case(Chica,de) (regardless if Silva would be flat or nmod), as it's just a surname. Besides this, the example given in the other issue of "Von Hohenlohe gewann das Rennen." is persuasive.

There's a trade-off between clearly available named entity information and full transparent syntax in the most thorough way that could be understood. It isn't obvious what the cutting point should be, but I tend to think that it's preferable to err on the side of too much transparency (that would give syntactical information that could be post-processed or ignored), instead of risking throwing valuable information away.

A concrete example: in a sentence such as "Eu visitei o Rio Negro" (lit. I visited the Black River), from a transparent syntax we could only know that Rio Negro is a specific named entity from the capitalization. If not for that, it would be necessary to disambiguate between a specific river called "Rio Negro" (from background knowledge), and another river defined by context which could be referenced by "o rio negro". However, this seems exactly what would happen at human communication, and there are possible ways of recovering the named entity information from the transparent syntax (with ambiguity on par with human natural ambiguity). But if Negro is classified as PROPN and flat, every clue about blackness is lost (except, of course, from the word form itself, but that would be the same as retagging it).

sylvainkahane commented 6 years ago

For our (Spoken) French treebank we were very unconfortable with flat, because flat is dedicated to headless constructions, but many constructions analyzed as flat in the guide are quite clearly headed in French. In French contrary to English, N1 is always the head in an N1 N2 constructions.

In consequence we decided not to use flat and to use appos. More exactly, nmod:appos because we also decided to split appos in two different relations nmod:appos and conj:appos (but it's a different story we could discuss in another discussion thread).

The point here is that many Named Entities, including person names in many languages, have a regular syntactic structure. Rio Negro is 100% clear from the syntactic point of view: amod(Rio, Negro). What people try to catch using another relation has nothing to do with syntax. If we want to annotate NEs in UD we must say it explicitely and introduce clean tools for that.

dseddah commented 6 years ago

Hi Sylvain and all, I agree that those kind of examples should also have named entities labels. In the case of the French QuestionBank[1] we had cases of question about movies (with and without quotes, with and without capitals) like

1) Who directed Who Framed Roger Rabbit? with movie titles either in English or French which causes trouble because foreign titles would have FW pos tags and a flat structure while regular French titles would get regular annotation and frankly I don’t really see why (despite the whole « let’s not annotate foreign parts general view).

There’s also more complicated cases as seen in blogs and such 2) Qui a aimé le Comment je me suis disputé.. de Deplechin ? Who did like the How I got into an argument. of Depleschin ? (note: Comment je me suis disputé.. is the movie title, could work with « Everything you always wanted to know about sex and never dared to ask » )

where even though the title has a regular internal syntax and could be annotated as such, it behaves as a unit from its matrix clause pov and seems much more complicated to annotate (what’s the head of that title to begin with? get ? how? ). If it were only me I’ll gather all movie titles from imdb as « word with spaces » list and let someone interested in NER annotation deals with that later if they want but I’m not sure some people would not spit out their coffee if they see this half-serious proposal :)

Best, Djamé

[1] http://alpage.inria.fr/Treebanks/FQB/ (tl;dr: french questionbank, phrase-based structures, native surface dependencies and deep syntax graphs)

dan-zeman commented 6 years ago

We have stressed on many occasions that the UD guidelines and structural representation is not about named entities. If there is in the future a UD-based corpus with named entities, it will be an extension over UD (or a new version of UD guidelines). Some datasets still violate this perspective and have non-nominal words in multi-word entities tagged as PROPN, but it is an error.

I would be absolutely fine with amod(Rio, Negro); it would even be the way I would annotate it. But I am not a speaker of Portuguese. And I have learned that, for instance, some speakers of French are strongly against treating surnames that have the form "de Location" as nmod(FirstName, Location); case(Location, de). The synchronic and diachronic analyses seem to diverge here, at least that was my understanding of how the native speakers feel it. That's why I now said that it probably needs a language-particular, or even phrase-particular solution.