UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
270 stars 245 forks source link

Use and documentation of `name` #324

Closed nschneid closed 7 years ago

nschneid commented 8 years ago
  1. Should "Microsoft Corporation", "Google Inc.", and similar use the name or compound relation? Results of this query mostly have compound, but at least one has name.
  2. Honorific titles before names like "Mr.", "Dr.", and "Rev." currently seem to have compound, though I'm not sure why name isn't preferred (as they operate in a sort of subgrammar specific to proper names).
  3. Job titles before names like "President Obama" also seem to have compound.

The name documentation page doesn't mention these cases.

dan-zeman commented 8 years ago

I would say that President Obama should be analyzed as nmod(Obama, President), and the same for "Mr." etc.

Note that this issue overlaps with #253 where a relevant discussion has occurred.

Along the same lines, I would do a non-name relation between Microsoft/Google and Corporation/Inc. (which also means I would not tag Corporation or Inc. as PROPN).

nschneid commented 8 years ago

Thanks, @dan-zeman. I don't have a strong opinion one way or another, but I would like to see a clearer definition of the boundaries of name. It says "Words joined by name should all be part of a minimal noun phrase"—is that to say that within a minimal name, there's no apparent modification structure? I.e., in personal names, neither forename nor surname is intuitively the head; whereas titles/designators like "President", "Mr.", "Inc." act as modifiers?

dan-zeman commented 8 years ago

That's how I feel it, although Mr being a modifier seems more a convention than obvious fact. It may depend on the focus. You could state that this Obama is a president, or vice versa, that this president's name is Obama. But rather than trying to guess the right interpretation in each sentence, I'd prefer a fixed rule here (at least within one language).

nschneid commented 8 years ago

An observation that may or may not be relevant: it's unusual to use "Mr." or "Dr." (not spelled out) anywhere but before a proper name. Whereas someone's full name can be abbreviated with either the forename or the surname. Similarly with "Inc." in the U.S./"Ltd" in the UK/"GMBH" in Germany (it can only go after the name of the company). Perhaps these abbreviations are parasitic on proper names in a way that full titles like "President" are not. Although maybe that's just an orthographic convention; we can spell them out, too, like "Doctor Zhivago".

Maybe refined relation labels would clarify nmod as the obvious for these—nmod:prename and nmod:postname??

amir-zeldes commented 8 years ago

I think I prefer name to nmod for Mr. and Inc., or even compound (though I think that term normally means something different for me). For some applications, particularly entity recognition and coreference, nmod implies a full NP (or dependency chain equivalent thereof), which is a candidate for being an entity or mention.

For example, if we have [the King of [Spain]], Spain is nmod, and in this case also referential. A label like name, by contrast, can be used to indicate that something is a complex name, which does not contain nested entities. For example, John Smith is not [[John] Smith] (the name label would indicate it's not a subtype of Smith), but [[Texas] Senate] (compound) is a subtype of Senate, and Texas is a potential mention of that state.

Comparing that with Mr., Inc., and President, I think in all 3 cases they are part of a complex named entity mention: [Google Inc.], [President Obama], (not an apposition with two mentions) and [Mr. Obama]. I realize this is rather focused on the entity mention application, but to me it makes sense as an argument against nmod or compound. I agree that something about all of these is 'parasitic', and I'd like to see them set apart from compounding and the typically prepositional and potentially referential nmods.

dan-zeman commented 8 years ago

@nschneid : What about Mr. President?

@amir-zeldes : While I am not completely sure what to do with Mr. (although I don't think that name is a suitable term to describe it), in general I am hesitant to formulate a rule just because it is good for named entity recognition. NER is a complex task that operates on a different level than our dependencies. We will never be able to solve it with the current representation alone, i.e. it will need a separate annotation anyway. For example, Czech Republic is clearly a named entity, but Czech is an adjective and thus it should be attached as amod(Republic, Czech).

If President is part of the “name” President Barack Obama, are other occupations too? Such as the “smith John Smith”?

nschneid commented 8 years ago

@nschneid : What about Mr. President?

If the conclusion is that "Mr." is a modifier, I think that still works with "Mr. President". Spelled-out titles like "President", "Senator", and "Doctor" can occur on their own (or with an article). Perhaps one could even argue that "Mr." and similar are determiners, as it is unusual to combine them will an article.

amir-zeldes commented 8 years ago

@dan-zeman : I agree completely that NER can't be the deciding factor for a syntactic analysis, but I do think the distinctions involved are symptomatic of the differences between nmod and "Mr.". The ability to be a referring expression has been used as an argument in the analysis of compounds, so I thought it could be relevant here too. But let me give a couple of more purely morphosyntactic arguments from other languages:

The nmod label is generally used for adpositional modification, and, in some languages, for case marked nominals with no adposition (e.g. instrumental in Slavic languages). What these have in common is that they are case marked; they are not directly part of the NP head and can therefore take separate case. They can be chained recursively and take all sorts of cases: we can have an instrumental modified by a genitive, etc.

This is not the case for things like Mr. and President, because they are, in some sense, really just a part of the same core referring expression. I think the same applies to "President Barack Obama", which for me has the same syntactic type as "smith John Smith", "actor Leonardo DiCaprio", etc. In all of these cases, we see agreement, not government, and non-recursion. For example, in Polish, we have:

Prezydent Truman (President Truman, both nominative)
Pan Truman (Mr. Truman, same)

If we have multiple modifiers, I don't think they nest. For example in German, a Prof. Dr. is not a subtype of Dr., but someone who is both a Prof. and a Dr.:

Dr. Schmidt
Herr Schmidt (Mr. Schmidt, nominative)
Herrn Schmidt (same, but accusative, marked on Herr, but invisible on Schmidt)
Prof. Dr. Schmidt (two modifiers of Schmidt - and if case were marked it would have to agree throughout)

I also don't think these are compounds. In German, we would expect spelling together. But more importantly, titles like Mr. etc. do not agree with the word order of compounds in each language. For example, both German (head last) and Hebrew (head first) have Mr. as a prefix, and none of the compounding morphology you'd expect in either language:

DE:
Herr Schmidt (Mr. first)
Kranken-haus (sick-house = hospital, head last, definitely a compound)

HE:
Mar Schmidt (Mr. first)
Beit-xolim (house-sick = hospital, head first)

So these constructions appear to be distinct. In MSArabic as well, you'd get the same word order as in Hebrew (Sayyid Schmidt), but genitive case linking in the compound (baitu l-waladi, but agreement in the Mr. case (As-Sayyidu Muhammad_u_n..). Maybe name is not the optimal label here, but for me nmod is worse. If subtyping is an option, maybe name:title is a possible idea?

As for the Czech Republic, I think this also supports the claim that named entities typically don't contain referential categories like nmod, since amod is another category that does not generate an entity mention ("green chair" is a single entity mention, and "green" is not a candidate for a referring expression).

@nschneid : I guess it's possible to see them as det for English on some level, but in the Arabic example, the Mr. combines with the article regularly, and in German optionally (der Herr Schmidt), just like with any name (der Hans)...

sebschu commented 8 years ago

@amir-zeldes Yes, nmod is typically used for adpositional modifiers or case marked NPs, but there already exist some exceptions to this (e.g., we use nmod - or actually its sub-relation nmod:npmod - to attach a share in IBM earned $ 5 a share). Therefore I don’t think that using nmod would necessarily be problematic just because Mr. is neither an adpositional modifier nor case marked in other languages.

While I agree that nmod does not seem optimal here, I think it is still better than name, which in my opinion should only be used for actual parts of a name.

For the same reason, however, I think that name should be used for company names such as Google Inc. (and consequently Inc. should have the tag PROPN, which is consistent with the PTB guidelines).

@nschneid: As noted in #253, there are some inconsistencies in the English treebank regarding these constructions, which we should correct in one of the future releases. I also just updated the name docs, which still contained some incorrect information regarding names with foreign function words such as Ludwig van Beethoven.

amir-zeldes commented 8 years ago

@sebschu - not that this is necessarily a crucial argument here, but I think "a share" would probably be case marked in many languages, and it's substitutable for a PP 'per share'. It would be an oblique PP in Hebrew for example, and the share is, referentially, certainly not part of IBM, so that falls in line with what I was saying above.

dan-zeman commented 8 years ago

Maybe we actually do not need the name relation in UD. Maybe we need something that could be called just chunk. It would work similarly to name and it would cover the prototypical name-surname pairs, but its scope would be broader and it would be defined morphosyntactically rather than semantically, in line with what @amir-zeldes notes above. It would be the minimal noun phrase where no internal head-dependent relations can be easily established. Each language would have to specify separately what are the properties of such chunks. But in general one would not expect any intervening adpositions, particles and other function words, and if the nouns are case marked, then it should be because of factors external to the chunk.

So the chunk relation would cover the whole of current name relation, plus certain specific cases of the current nmod relation. We would also have to demarcate the border between chunk and appos (currently between nmod and appos; issue #253 also has a discussion on that). Perhaps the only distinctive factor would be an intervening punctuation symbol (=> appos).

There would still be interesting distinctions left for the parser to decide. For example, in the Czech do auta prezidenta Obamy (“to the car of President Obama”), all three nouns are morphologically genitive, but for auta it is caused by the valency of the preposition do, and prezidenta Obamy is in genitive as an equivalent of the English preposition of. Both prezidenta and Obamy receive the genitive from outside; the analysis would be nmod(auta, prezidenta) but chunk(prezidenta, Obamy). On the other hand, if the phrase is do auta prezidenta Botswany (“to the car of the President of Botswana”), we have again three nouns in the genitive but there is no chunk: nmod(auta, prezidenta); nmod(prezidenta, Botswany).

nschneid commented 8 years ago

Another distinction that could be made for names is to distinguish between the parts of the name that are completely arbitrary (up to whoever performed an act of naming, without any semantic relationship to the entity in question) vs. parts which have transparent semantics.

As a point of reference, AMR includes a type and a name for each named entity. The name field includes simple honorifics like "Mr." and suffixes like "Jr.", in which case person is used for the type. But information indicating an occupation or role ("Dr.", "President", "chairman") is used for a finer-grained type (or compositional semantic structure) instead of being part of the name field. With proper names ending in "Inc.", "Corp.", "Corporation", etc., that word is included in the name field, and company is used for the entity type. Similarly, "Democratic Party" is preserved as such in the name field, with political-party as the entity type. Note that AMR has its own ontology of entity types to be used unless a more specific term is given in the sentence.

amir-zeldes commented 8 years ago

This has been really interesting to discuss, thanks! @dan-zeman I see what you're saying about the use of name so far - initially, when I was trying to write a conversion script from Stanford Typed Dependencies to UD, this was the category that had me stumped, because you need semantic knowledge to distinguish it from the more compound-like nn cases. I don't know if the chunk terminology will be able to catch on, since name has been in use for a while, but I agree with what you said above about the case criterion. Chunk could be fine, or I think at least subtyping one of the existing relations (name?) with something like 'title' or 'pre/postname' or something of that sort could be useful. This all reminds me a little of the German Tiger Treebank's use of the "NK" label for 'NP kernel', which was assigned to agreeing non-heads in the NP constituent, like determiners and adjectives. Though I think the motivation there may have been to avoid the DP/NP controversy.

I think your idea of the definition of nmod as being 'non-agreeing' (or let's just say it can have the same case, but not because of agreement) is an interest one, because it is purely morphosyntactic. I don't think you'll be able to delimit apposition using commas though, not just because that might not generalize across languages, but also because even in English I don't think the comma is always a necessary condition. I think somehow it has to do with full NP status. Compare the following:

[Mary], [my sister], always says... (definitely appos)
[My sister] [Mary] always says... (in my opinion also appos, see below)
Sister Mary says (in reference to a nun, not appos)
*Mary sister (in reference to a nun - not movable since part of same NP)

Also with 'president':
[President Obama]
*[Obama President]
[The President], [Obama]
[Obama], [the President]

The movement test suggests that "My sister Mary" is two full NPs, much like the latter (definite) 'president' case, probably related to the fact that 'sister' gets its own determiner 'my'. I think jobs and honorific titles in English are much more bleached and less independent than other nouns (parasitic as @nschneid put it) and in some sense don't form their own referential NPs, unlike nmod. So maybe chunk is a good term for this, but I'm not sure if others will prefer a new major label or just a subtype for these, and I'm not sure if people will want to drop name (I initially didn't like it at all but then just got used to it!)

nschneid commented 8 years ago

Yeah, name:pre for "Mr.", "President", etc. used at the beginning of the name and name:post for "Inc." and similar might work. If these are well-defined phenomena, it could be better to give them names rather than force annotators/users to remember whether they're nmod, amod, compound, or name.

amir-zeldes commented 8 years ago

That makes sense to me. Just one question though: when you say name:pre, do you envision linking backwards from the first token of the 'actual' name? Or would we stick to 'left to right', and just label the first deprel name:pre?

nschneid commented 8 years ago

I was thinking of linking backwards. So "Rev. Dr. Martin Luther King , Jr." would be

name:pre(Martin, Rev.)
name:pre(Martin, Dr.)
name(Martin, Luther)
name(Martin, King)
name:post(Martin, Jr.)
punct(Jr., ",")

"Secretary of State Hillary Clinton" would be

name:pre(Hillary, Secretary)
nmod(Secretary, State)
case(State, of)
name(Hillary, Clinton)

"Mr. and Mrs. John Smith , Esq." would be

name:pre(John, Mr.)
cc(Mr., and)
conj(Mr., Mrs.)
name(John, Smith)
name:post(John, Esq.)
punct(Esq., ",")
sebschu commented 8 years ago

The structure of all these trees makes sense to me and there might be a case for name:pre so that we don't have to use nmod, but I really don't see the benefit of name:post over simply using name. @nschneid and @amir-zeldes: Do you have any examples or tasks in mind where this distinction would be important?

nschneid commented 8 years ago

I agree that name:post is probably less important than name:pre, at least for English. But there are a few words like "Jr." and "Inc." that have a restricted distribution, attaching at the end of names.

amir-zeldes commented 8 years ago

I'm not sure if it's a relevant argument, but in many languages, the elements that do the same work as "Mr." are postposed, e.g. Chinese of Japanese (-xiansheng, -san etc.).

The main advantage of marking pre and post as separate from regular name, at least for me, is being able to determine the lexical head for tasks like NER and coreference resolution. If something is "IBM Inc", then it's a mention of the head "IBM", not "Inc" (in fact, I regularly rewire trees exactly as the 'post' guideline suggests before reading them into a coreferencer I'm working on). But I freely acknowledge that it doesn't necessarily have to be the job of the parser to do this just because of NER - it would just be nice if they could help out!

spyysalo commented 7 years ago

Closing due to lack of recent activity and name being removed in v2 (replaced with flat with option to use flat:name). Please consider opening a new issue with reference to the new guidelines and this discussion if there are open questions relating to this issue.