English mischievous nominals involving names and numbers

English has a tangled mess of minor patterns for constructing proper names. Having revised the flat guidelines to clarify the prototypical cases of headless vs. internally structured proper names, it is worth returning to "mischievous" cases that @amir-zeldes and I explored in this paper. Relatedly, @dan-zeman wrote a paper exploring how dates might be treated across several languages. For this thread, let's focus on constructions lacking evidence from agreement.

These constructions have been discussed in disparate threads, e.g. #455 and #654. I would like to see if looking at the range of constructions can lead us to some general principles for determining headedness and choosing a deprel.

This table from the paper offers a summary:

Also: dates written like February 23, February 23rd, February the 23rd

Some starting points:

appos requires two full nominals, which would seemingly exclude most of these construction (except my brother Sam)
actor Ulliel and President Obama seem like premodification constructions as the first part can be omitted, but unlike compound, the first part is plural if the second part is coordinated. This seems to be a distinct type of nominal modification construction, which we call nmod:desc ("descriptor").
Some proper names license determiners and behave like right-headed compounds: the Kashmir Valley. So compound clearly works there.
The table suggests a head and possible relations for the other cases. But these are subject to debate, as typical headedness criteria like omissibility and modifiability may or may not be helpful within proper names, whose elements generally cohere tightly. For expressions like Lake Michigan, there is tension between word order (strong tendency for English compounds being right-headed, modulo a couple of clear exceptions like attorney general) and semantics (ordinarily the more general category is the head).

Concrete questions I have been struck on:

Morphosyntactically, is Lake Michigan more like Mirror Lake (maybe with inverted headedness) or like President Obama?
To what extent should determiner licensing be a criterion for diagnosing the structure of proper names, given that names are commonly exempt from determiner rules of common nominals? If a common noun like lake is incorporated into a proper name, is it subject to the determiner-sensitive omissibility test (e.g. is *I went to lake. evidence that Lake is a modifier in Lake Michigan)?
Both books have long Chapter 10s/?Chapters 10: does this reveal anything about headedness, or is the plural ending on 10s a phrasal clitic? Cf. Chapters 10 and 11.
Formula One (racing) has a super opaque internal semantics and (to my knowledge) lacks determiners. Is this a good candidate for flat, and if so, what about other noun+number names?

I have to make up my mind about the concrete questions you ask but a side note: The table from your paper mentions nummod as an option in several cases and I think that it is wrong (and it has been discussed and resolved already). nummod is for quantity, so it would be good in expressions like 10 chapters or one formula. But not in Chapter 10 or Formula One or Figure 4 or Firefox 58.0.

To me, Mirror Lake looks very similar to standard nominal compounds in English (mirror case, mirror hall?) and the only difference seems to be that it does not use a determiner.

Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right? Michigan is just a meaningless label from the English perspective, although I think that it actually means "big lake" in one of the Algonquinian languages. In this light, I would say it is closer to President Obama, which also is not a compound (unlike e.g. U.S. president).

I am not super excited about looking at determiner licensing exactly for the reasons you cite. You would need a determiner with the non-name compound a/the mirror lake but you don't need it with Mirror Lake. On the other hand, I would not completely exclude it if we do not have any better clues.

As a side remark, in Czech Lake Michigan would be jezero Michigan, where jezero "lake" is not part of the name, and jezero should be the head because it would inflect for the case required by the surrounding context (nominative as subject, accusative as object etc.) while Michigan would stay in nominative no matter what. This would be different from prezident Obama where both words would inflect. I'm not claiming that it should affect the English solution in any way. Perhaps only if there were no good criteria in English and a bunch of other languages had criteria like this, someone might say that we decide it in English the same way for the sake of parallelism. But it would be the last thing I would consider.

in Czech Lake Michigan would be jezero Michigan

As a side note to this side remark, even more common translation into Czech (according to several corpora) is Michiganské jezero. In this case, jezero is part of the name (named entity) because Michiganské is morphologically and syntactically an adjective. Both words inflect (e.g. genitive Michiganského jezera) and jezero is the head.

As a side comment regarding the side note:-), it is difficult to define which words are part of the name, even within a single language where capitalization usually helps. One could think that Michiganské as an adjective needs a governing noun, thus jezero must be part of the name as well. However, in náměstí Míru (square of Peace), náměstí is not considered part of the name and thus it is not capitalized, despite it is also the head (and Míru is a genitive noun). However, the subway station of the same name has both words capitalized: stanice metra Náměstí Míru, according to the official prescriptive grammar. Most Czech speakers never learn all the capitalization rules correctly.:-)

Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right?

Looking at this list as a sample, that appears to be the main usage (the 2nd part is inherently proper), but "Lake Pleasant" and "Lake Red Rock" do occur. Of the 5 Great Lakes, "Lake Superior" has a transparent second part (the others are derived from native words—Erie, Huron, Michigan, Ontario).

In terms of place names mentioning geographical features, "Mount" is also an initial word, and "Mount Pleasant" is a popular example where the 2nd part is transparent.

Channeling @amir-zeldes (who is on vacation), the Lake X and Mount X patterns are presumably relics of older word orders in English or borrowings from French that are not productive beyond names.

Let me see if I can articulate some principles based on the sentiments above. A straw man proposal to define nmod:desc:

nmod:desc is for a class of constructions where an unmarked, bare nominal premodifies the head of a nominal. It is a subtype of nmod and (in principle) a special case of nmod:unmarked. We use the term "descriptor word" for the modifying noun and "descriptor phrase" for a phrase that it heads, possibly with dependents of its own.

A descriptor phrase denotes a main category to which a referent belongs. The descriptor is a modifier—typically of a name or number, often forming a larger proper name with its head. The head is essential to denoting the correct referent, whereas the descriptor is omissible (at least with strong context to narrow down a set of possible referents). This distinguishes it from nominal compounds, where the main category noun is the head.

The descriptor phrase is not a full noun phrase: it does not begin with a preposition or determiner (or possessive taking the place of a determiner). appos is therefore not appropriate to link the descriptor with its head.

Given a nominal-nominal combination X Y, the following tests can be applied:

If X is possessive, nmod:poss(Y, X) applies.
If X is a number specifying the quantity of Y, nummod(Y, X) applies.
If X is a full nominal, appos(X, Y) may be appropriate. E.g. my brother Sam, the River Thames
If Y denotes the category of the referent, compound(Y, X) is likely appropriate. This includes cases where Y is a transparent category: Mississippi River, Mirror Lake
If X is a prepositional phrase, compound(Y, X) is likely appropriate.
If X lacks a transparent meaning that denotes the category of the referent, it is not a descriptor. This includes some noun+number constructions:
- Formula One is a racing event, not a formula, so it is a headless construction: flat(Formula, One).
- Firefox 58.0 can be said to be headed by the first word, which is a proper name not a category, so nmod:unmarked(Firefox, 58.0).
dates - policy TBD

Combinations not eliminated by the above constraints are good candidates for nmod:desc(Y, X). Typical morphosyntactic characteristics:

Y is often an inherently proper name (with no content semantics, only referential meaning) or a number used as an identifier (rather than as a quantity).
To express multiple referents of a similar type but with different names, a plural X may distribute across coordinated Y: actors Sheen and Janney, Lakes Michigan and Erie, Sections 1 and 2
- While the ability to pluralize X is characteristic of many of these patterns, it does not hold 100% of the time: ?Mounts Everest and Kilimanjaro; Read Chapter 1 and 2 is well-attested along with Read Chapters 1 and 2
Multiple referents with exactly the same name would usually result in a phrase-level plural (the two books' Section 1s). This should not be taken as evidence that Y is the head of X, but as evidence that X and Y form a multiword expression.
Y is generally not omissible without adding a determiner.
Omitting X makes the nominal less detailed and formal, but still plausibly grammatical. Sometimes, Y is sufficiently specific that X is omissible in most contexts (actors Sheen and Janney → Sheen and Janney; President Obama → Obama). In other cases, X is a conventional part of the name such that omitting it requires strong context (plain Michigan would normally refer to the state, but in the right context it could be an abbreviated reference to the lake: Which lake is nicer, Erie or Michigan?).

A consequence of the above guidelines is that each of the dependency relations at issue should have a fairly rigid directionality, resulting in the word order generalization:

{compound, nmod:desc, nmod:poss, nummod} << head << {appos, nmod*, nmod:unmarked*}

(* nmod and nmod:unmarked would typically be post-head, but there are some pre-head usages. nmod examples include "at least/most" before a quantity, fronted dependents of a nominal e.g. the entity you are negotiating on behalf of, and some PP modifiers of nominal predicates or fragments. nmod:unmarked examples include "a couple" and extent modifiers of PPs.)

This seems appropriate as these pertain to English constructions where word order is paramount.

Most Czech speakers never learn all the capitalization rules correctly.:-)

same in Portuguese and many other languages I guess. I guess we can’t even talk about rules in general but only rules for specific editors or conventions for specific contexts. So maybe it is hard to qualify as correct or incorrect.

UniversalDependencies / docs

English mischievous nominals involving names and numbers #1040