UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

English mischievous nominals involving names and numbers #1040

Open nschneid opened 4 months ago

nschneid commented 4 months ago

English has a tangled mess of minor patterns for constructing proper names. Having revised the flat guidelines to clarify the prototypical cases of headless vs. internally structured proper names, it is worth returning to "mischievous" cases that @amir-zeldes and I explored in this paper. Relatedly, @dan-zeman wrote a paper exploring how dates might be treated across several languages. For this thread, let's focus on constructions lacking evidence from agreement.

These constructions have been discussed in disparate threads, e.g. #455 and #654. I would like to see if looking at the range of constructions can lead us to some general principles for determining headedness and choosing a deprel.

This table from the paper offers a summary:

image

Also: dates written like February 23, February 23rd, February the 23rd

Some starting points:

Concrete questions I have been struck on:

  1. Morphosyntactically, is Lake Michigan more like Mirror Lake (maybe with inverted headedness) or like President Obama?
  2. To what extent should determiner licensing be a criterion for diagnosing the structure of proper names, given that names are commonly exempt from determiner rules of common nominals? If a common noun like lake is incorporated into a proper name, is it subject to the determiner-sensitive omissibility test (e.g. is *I went to lake. evidence that Lake is a modifier in Lake Michigan)?
  3. Both books have long Chapter 10s/?Chapters 10: does this reveal anything about headedness, or is the plural ending on 10s a phrasal clitic? Cf. Chapters 10 and 11.
  4. Formula One (racing) has a super opaque internal semantics and (to my knowledge) lacks determiners. Is this a good candidate for flat, and if so, what about other noun+number names?
dan-zeman commented 4 months ago

I have to make up my mind about the concrete questions you ask but a side note: The table from your paper mentions nummod as an option in several cases and I think that it is wrong (and it has been discussed and resolved already). nummod is for quantity, so it would be good in expressions like 10 chapters or one formula. But not in Chapter 10 or Formula One or Figure 4 or Firefox 58.0.

dan-zeman commented 4 months ago

To me, Mirror Lake looks very similar to standard nominal compounds in English (mirror case, mirror hall?) and the only difference seems to be that it does not use a determiner.

Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right? Michigan is just a meaningless label from the English perspective, although I think that it actually means "big lake" in one of the Algonquinian languages. In this light, I would say it is closer to President Obama, which also is not a compound (unlike e.g. U.S. president).

I am not super excited about looking at determiner licensing exactly for the reasons you cite. You would need a determiner with the non-name compound a/the mirror lake but you don't need it with Mirror Lake. On the other hand, I would not completely exclude it if we do not have any better clues.

As a side remark, in Czech Lake Michigan would be jezero Michigan, where jezero "lake" is not part of the name, and jezero should be the head because it would inflect for the case required by the surrounding context (nominative as subject, accusative as object etc.) while Michigan would stay in nominative no matter what. This would be different from prezident Obama where both words would inflect. I'm not claiming that it should affect the English solution in any way. Perhaps only if there were no good criteria in English and a bunch of other languages had criteria like this, someone might say that we decide it in English the same way for the sake of parallelism. But it would be the last thing I would consider.

martinpopel commented 4 months ago

in Czech Lake Michigan would be jezero Michigan

As a side note to this side remark, even more common translation into Czech (according to several corpora) is Michiganské jezero. In this case, jezero is part of the name (named entity) because Michiganské is morphologically and syntactically an adjective. Both words inflect (e.g. genitive Michiganského jezera) and jezero is the head.

As a side comment regarding the side note:-), it is difficult to define which words are part of the name, even within a single language where capitalization usually helps. One could think that Michiganské as an adjective needs a governing noun, thus jezero must be part of the name as well. However, in náměstí Míru (square of Peace), náměstí is not considered part of the name and thus it is not capitalized, despite it is also the head (and Míru is a genitive noun). However, the subway station of the same name has both words capitalized: stanice metra Náměstí Míru, according to the official prescriptive grammar. Most Czech speakers never learn all the capitalization rules correctly.:-)

nschneid commented 4 months ago

Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right?

Looking at this list as a sample, that appears to be the main usage (the 2nd part is inherently proper), but "Lake Pleasant" and "Lake Red Rock" do occur. Of the 5 Great Lakes, "Lake Superior" has a transparent second part (the others are derived from native words—Erie, Huron, Michigan, Ontario).

In terms of place names mentioning geographical features, "Mount" is also an initial word, and "Mount Pleasant" is a popular example where the 2nd part is transparent.

nschneid commented 4 months ago

Channeling @amir-zeldes (who is on vacation), the Lake X and Mount X patterns are presumably relics of older word orders in English or borrowings from French that are not productive beyond names.

nschneid commented 4 months ago

Let me see if I can articulate some principles based on the sentiments above. A straw man proposal to define nmod:desc:

nmod:desc is for a class of constructions where an unmarked, bare nominal premodifies the head of a nominal. It is a subtype of nmod and (in principle) a special case of nmod:unmarked. We use the term "descriptor word" for the modifying noun and "descriptor phrase" for a phrase that it heads, possibly with dependents of its own.

A descriptor phrase denotes a main category to which a referent belongs. The descriptor is a modifier—typically of a name or number, often forming a larger proper name with its head. The head is essential to denoting the correct referent, whereas the descriptor is omissible (at least with strong context to narrow down a set of possible referents). This distinguishes it from nominal compounds, where the main category noun is the head.

The descriptor phrase is not a full noun phrase: it does not begin with a preposition or determiner (or possessive taking the place of a determiner). appos is therefore not appropriate to link the descriptor with its head.

Given a nominal-nominal combination X Y, the following tests can be applied:

  1. If X is possessive, nmod:poss(Y, X) applies.
  2. If X is a number specifying the quantity of Y, nummod(Y, X) applies.
  3. If X is a full nominal, appos(X, Y) may be appropriate. E.g. my brother Sam, the River Thames
  4. If Y denotes the category of the referent, compound(Y, X) is likely appropriate. This includes cases where Y is a transparent category: Mississippi River, Mirror Lake
  5. If X is a prepositional phrase, compound(Y, X) is likely appropriate.
  6. If X lacks a transparent meaning that denotes the category of the referent, it is not a descriptor. This includes some noun+number constructions:
    • Formula One is a racing event, not a formula, so it is a headless construction: flat(Formula, One).
    • Firefox 58.0 can be said to be headed by the first word, which is a proper name not a category, so nmod:unmarked(Firefox, 58.0).
  7. dates - policy TBD

Combinations not eliminated by the above constraints are good candidates for nmod:desc(Y, X). Typical morphosyntactic characteristics:

A consequence of the above guidelines is that each of the dependency relations at issue should have a fairly rigid directionality, resulting in the word order generalization:

(* nmod and nmod:unmarked would typically be post-head, but there are some pre-head usages. nmod examples include "at least/most" before a quantity, fronted dependents of a nominal e.g. the entity you are negotiating on behalf of, and some PP modifiers of nominal predicates or fragments. nmod:unmarked examples include "a couple" and extent modifiers of PPs.)

This seems appropriate as these pertain to English constructions where word order is paramount.

arademaker commented 4 months ago

Most Czech speakers never learn all the capitalization rules correctly.:-)

same in Portuguese and many other languages I guess. I guess we can’t even talk about rules in general but only rules for specific editors or conventions for specific contexts. So maybe it is hard to qualify as correct or incorrect.