Open nschneid opened 4 months ago
I have to make up my mind about the concrete questions you ask but a side note: The table from your paper mentions nummod
as an option in several cases and I think that it is wrong (and it has been discussed and resolved already). nummod
is for quantity, so it would be good in expressions like 10 chapters or one formula. But not in Chapter 10 or Formula One or Figure 4 or Firefox 58.0.
To me, Mirror Lake looks very similar to standard nominal compounds in English (mirror case, mirror hall?) and the only difference seems to be that it does not use a determiner.
Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right? Michigan is just a meaningless label from the English perspective, although I think that it actually means "big lake" in one of the Algonquinian languages. In this light, I would say it is closer to President Obama, which also is not a compound (unlike e.g. U.S. president).
I am not super excited about looking at determiner licensing exactly for the reasons you cite. You would need a determiner with the non-name compound a/the mirror lake but you don't need it with Mirror Lake. On the other hand, I would not completely exclude it if we do not have any better clues.
As a side remark, in Czech Lake Michigan would be jezero Michigan, where jezero "lake" is not part of the name, and jezero should be the head because it would inflect for the case required by the surrounding context (nominative as subject, accusative as object etc.) while Michigan would stay in nominative no matter what. This would be different from prezident Obama where both words would inflect. I'm not claiming that it should affect the English solution in any way. Perhaps only if there were no good criteria in English and a bunch of other languages had criteria like this, someone might say that we decide it in English the same way for the sake of parallelism. But it would be the last thing I would consider.
in Czech Lake Michigan would be jezero Michigan
As a side note to this side remark, even more common translation into Czech (according to several corpora) is Michiganské jezero. In this case, jezero is part of the name (named entity) because Michiganské is morphologically and syntactically an adjective. Both words inflect (e.g. genitive Michiganského jezera) and jezero is the head.
As a side comment regarding the side note:-), it is difficult to define which words are part of the name, even within a single language where capitalization usually helps. One could think that Michiganské as an adjective needs a governing noun, thus jezero must be part of the name as well. However, in náměstí Míru (square of Peace), náměstí is not considered part of the name and thus it is not capitalized, despite it is also the head (and Míru is a genitive noun). However, the subway station of the same name has both words capitalized: stanice metra Náměstí Míru, according to the official prescriptive grammar. Most Czech speakers never learn all the capitalization rules correctly.:-)
Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right?
Looking at this list as a sample, that appears to be the main usage (the 2nd part is inherently proper), but "Lake Pleasant" and "Lake Red Rock" do occur. Of the 5 Great Lakes, "Lake Superior" has a transparent second part (the others are derived from native words—Erie, Huron, Michigan, Ontario).
In terms of place names mentioning geographical features, "Mount" is also an initial word, and "Mount Pleasant" is a popular example where the 2nd part is transparent.
Channeling @amir-zeldes (who is on vacation), the Lake X and Mount X patterns are presumably relics of older word orders in English or borrowings from French that are not productive beyond names.
Let me see if I can articulate some principles based on the sentiments above. A straw man proposal to define nmod:desc
:
nmod:desc
is for a class of constructions where an unmarked, bare nominal premodifies the head of a nominal. It is a subtype of nmod
and (in principle) a special case of nmod:unmarked
. We use the term "descriptor word" for the modifying noun and "descriptor phrase" for a phrase that it heads, possibly with dependents of its own.
A descriptor phrase denotes a main category to which a referent belongs. The descriptor is a modifier—typically of a name or number, often forming a larger proper name with its head. The head is essential to denoting the correct referent, whereas the descriptor is omissible (at least with strong context to narrow down a set of possible referents). This distinguishes it from nominal compounds, where the main category noun is the head.
The descriptor phrase is not a full noun phrase: it does not begin with a preposition or determiner (or possessive taking the place of a determiner). appos
is therefore not appropriate to link the descriptor with its head.
Given a nominal-nominal combination X Y, the following tests can be applied:
nmod:poss(Y, X)
applies.nummod(Y, X)
applies.appos(X, Y)
may be appropriate. E.g. my brother Sam, the River Thamescompound(Y, X)
is likely appropriate. This includes cases where Y is a transparent category: Mississippi River, Mirror Lakecompound(Y, X)
is likely appropriate.flat(Formula, One)
.nmod:unmarked(Firefox, 58.0)
.Combinations not eliminated by the above constraints are good candidates for nmod:desc(Y, X)
. Typical morphosyntactic characteristics:
A consequence of the above guidelines is that each of the dependency relations at issue should have a fairly rigid directionality, resulting in the word order generalization:
compound
, nmod:desc
, nmod:poss
, nummod
} << head << {appos
, nmod
*, nmod:unmarked
*}(* nmod
and nmod:unmarked
would typically be post-head, but there are some pre-head usages. nmod
examples include "at least/most" before a quantity, fronted dependents of a nominal e.g. the entity you are negotiating on behalf of, and some PP modifiers of nominal predicates or fragments. nmod:unmarked
examples include "a couple" and extent modifiers of PPs.)
This seems appropriate as these pertain to English constructions where word order is paramount.
Most Czech speakers never learn all the capitalization rules correctly.:-)
same in Portuguese and many other languages I guess. I guess we can’t even talk about rules in general but only rules for specific editors or conventions for specific contexts. So maybe it is hard to qualify as correct or incorrect.
English has a tangled mess of minor patterns for constructing proper names. Having revised the
flat
guidelines to clarify the prototypical cases of headless vs. internally structured proper names, it is worth returning to "mischievous" cases that @amir-zeldes and I explored in this paper. Relatedly, @dan-zeman wrote a paper exploring how dates might be treated across several languages. For this thread, let's focus on constructions lacking evidence from agreement.These constructions have been discussed in disparate threads, e.g. #455 and #654. I would like to see if looking at the range of constructions can lead us to some general principles for determining headedness and choosing a deprel.
This table from the paper offers a summary:
Also: dates written like February 23, February 23rd, February the 23rd
Some starting points:
appos
requires two full nominals, which would seemingly exclude most of these construction (except my brother Sam)compound
, the first part is plural if the second part is coordinated. This seems to be a distinct type of nominal modification construction, which we callnmod:desc
("descriptor").compound
clearly works there.Concrete questions I have been struck on:
flat
, and if so, what about other noun+number names?