Classifiers - clf - Githubissues

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

267 stars 245 forks source link

Classifiers - clf #374

Closed ermanh closed 4 years ago

ermanh commented 7 years ago

Hi, I'm working on the Cantonese and Chinese-HK corpora with @kimgerdes and was happy to see that v2 has now included classifiers as a relation (http://universaldependencies.org/u/dep/clf.html).

However, the dependency parse in the Mandarin examples (nummod(noun, num) and then clf(num, classifier) for the sequence [NUM + CLF + NOUN]) seems to be somewhat at odds with the language data, and I also wonder if the page description should be more open-ended to take into account cross-linguistic differences in classifier syntax.

Specifically, the following description doesn't seem to apply well to all classifier languages, or necessarily to all classifier phenomena within a single language:

"Syntactically, the classifier groups with the numeral, rather than the noun, and we therefore treat classifiers as functional dependents of numerals (or possessives)"* .

Here are a few points I'd like to raise and get your opinions on:

In both Mandarin and Cantonese (and many other Chinese languages), classifiers may also be present when demonstratives are used [that/this + CLF + NOUN].

這本書 zhe ben shu this CLF book 'this book' (Mandarin, classifier is actually optional)

嗰隻狗 go zek gau that CLF dog 'that dog' (Cantonese, classifier obligatory)

One could argue then that the classifier should depend on the demonstrative in this situation, but see (2) below.

Bare classifier NPs [CLF + NOUN] are possible in a number of languages, including at least most Chinese languages (incl. both Mandarin and Cantonese), Vietnamese, Hmong, and Bangla, with a definite and/or indefinite reading. In these cases, there would be nothing inside the NP (square bracketed in the below examples) except for the noun that the classifier could depend on.

她買了 [本書] ta mai le [ben shu] she buy PERF [CLF book] 'She bought a book.' (Mandarin)

佢買咗 [本書] keoi maai zo [bun syu] 3sg buy PERF [CLF book] 'She bought a/the book.' (Cantonese)

[Nguoi chong] rat tot [CLF husband] very good ‘The husband was very good.’ (Vietnamese, Daley 1998)

[tus tsov] tshaib tshaib plab [CLF tiger] hungry hungry stomach ‘The tiger is/was very hungry.’ (Hmong, Jaisser 1987)

ami [boi-gulo] kini ni I [book-CLF] bought not ‘I didn’t buy the books.’ (Bangla, Dayal 2014)

In Mandarin and Cantonese, an NP can consist of just a classifier along with a numeral and/or demonstrative, with the classifier behaving like the head of the NP. This occurs often when the noun has already been mentioned or is already implicit in context.

我買兩本 ngo maai loeng bun (Cantonese) wo mai liang ben (Mandarin) I buy two CLF 'I'm buying two (books)'

In Cantonese, a reduplicated classifier also functions as a quantifying pronoun meaning "all" or "each one of them" (referring to the type of noun(s) the classifier would go with).

本本都平 bun-bun dou peng all also cheap 'All of them (books) are cheap' (Cantonese)

Classifiers typically derive from nouns historically (Aikhenvald 2000). While some classifiers have become generic and have scant semantics left, many still retain some semantics (albeit more abstract, such as "long thin thing"), while others, at least in Chinese languages, are still used as full nouns as well (usually container nouns like 'box', 'cup', etc.).

一個杯{子} ['cup' as noun] jat go bui (Cantonese) yi ge beizi {Mandarin} one CLF cup 'a cup'

一杯茶 ['cup' as classifier] jat bui caa (Cantonese) yi bei cha (Mandarin) one CLF(cup) tea 'a cup of tea'

(Incidentally, I'm curious how Danish/Norwegian/Swedish en kop(p) kaffe "a cup of coffee" is parsed.)

These four cases show that yes, classifiers have functional uses, but at the same time they are still somewhat noun-like if not pronoun-like at least in some Chinese languages. For some classifier languages the classifier may well be "completely functional" and inseparable from the numeral, but that is not the case for all classifier languages as shown in the data above.

The WALS chapter on numeral classifiers (especially the "Theoretical Issues" part) has a brief synopsis on the diversity of classifiers in classifier languages: http://wals.info/chapter/55

In summary, I wonder if it might be a good idea to be more neutral in the description page about how clf is to be implemented, i.e., on a language by language basis?

For the Chinese examples, I think there's a good argument to be made to treat the classifier as a direct child of the noun instead (clf(noun, classifier)), given the data we've seen especially with [CLF + NOUN] and [DET + NUM + CLF] noun phrases. That way the classifier is linked the same way no matter the syntactic environment (except when acting as a head noun in [DET + NUM + CLF]. One might even go so far as to argue the numeral should depend on the classifier in [NUM + CLF + NOUN], given that [NUM + NOUN] is not possible but [CLF + NOUN] is.

The alternative is to link the classifier differently in each of those cases (whether in terms of arrow direction or using other labels), with the analysis that the classifier fulfills a different function in different cases (e.g., one might argue the classifier in [CLF + NOUN] NPs is functioning like a determiner). This would probably end up chaotic, however, given there is no tag for classifiers to unify it in the same way that, say, an adposition in some languages would still be tagged ADP but given mark when it has an additional function as a clausal subordinator.

What are your thoughts?

I realize this is a really long post, but couldn't find an easy way to separate it given the Chinese data also serves to show the variance in how classifiers can be used. Apologies in advance. Thanks!

Herman Leung

jnivre commented 7 years ago

Thanks for bringing this up. Classifiers were added late in v2 and it is quite clear that we need better guidelines. There was a previous discussion that led to the conclusion that classifiers should go with numerals (or demonstratives) rather than nouns, but perhaps this needs to be reconsidered. The fact that the numeral/demonstrative can be dropped is not a problem in principle because the classifier can be promoted to take its place in the dependency structure.

You can find the previous discussion of classifiers here: https://github.com/UniversalDependencies/UD_v2/issues/21

ermanh commented 7 years ago

Hi @jnivre, thanks for the quick reply, and for linking the previous discussion. It appears there was no clear conclusion yet?

Does the idea of promotion predicate on the analysis that a numeral/demonstrative was actually there but got dropped (due to ellipsis, for example)?

If that's the case, this actually seems fine for the Mandarin indefinite [CLF + NOUN] situation, because it is perfectly fine to add the numeral "one" before it and still obtain the same interpretation, and some linguists have proposed that "one" has indeed been elided in such a case. However, that will not work for Cantonese definite [CLF + NOUN] NPs, because you cannot add any modifier (demonstrative, in this case) and still obtain the same meaning. Cantonese definite [CLF + NOUN] NPs have no proximal/distal semantics.

Moreover, while the Mandarin demonstratives can stand alone and function as pronouns (instead of just as bound determiners), the Cantonese counterparts cannot, so in Cantonese [DET + CLF + NOUN] NPs, it would seem strange for the classifier to depend on the determiner when [CLF + NOUN] is a valid construction, right?

There is another interesting thing that I just remembered -- in Chinese languages, many measurement units (for time, length, weight, etc.) behave syntactically like classifiers, even though they are typically translated as nouns in other languages:

一 天／年／寸／磅 
yi tian / nian / cun / bang 
one day / year / inch / pound  (Mandarin)

They typically do not need a following noun, and what makes them classifiers, apart from being able to form a constituent with numerals, is that they can be modified by the quantifier 每 'every/each', which regular nouns cannot:

每 天(/*狗)
mei tian(/*gou)
every day(/*dog)  (Mandarin)

This also passes the test of Cantonese reduplicated classifiers functioning as quantifying (pro)nouns:

日日／年年／寸寸／磅磅 [classifier]
jat-jat / lin-lin / cun-cun / bong-bong
day-day / year-year / inch-inch / pound-pound
'every day / year / inch / pound' (Cantonese)

*狗狗 [noun]
gau-gau
dog-dog
Intended: 'every dog' (Cantonese)

Here it seems we're in a bit of a pickle because if the idea is to preserve the content-word to content-word dependency principle, we would have to say the numeral should depend on the classifier for these measurement words. And probably also for the case of [NUM + CLF] noun phrases where the classifier is arguably the head of the NP. The same principle would however prefer the classifier to depend on the numeral in [NUM + CLF + NOUN], per @manning in the previous discussion. Then if we also say that CLF is promoted in place of the numeral in [CLF + NOUN], but promoted in place of the noun in [NUM + CLF], would that be a problematic conflict?

manning commented 7 years ago

Hi @ermanh, as you can see in the discussion @jnivre referenced, following some input from Bill Croft, most of the actual hashing out of what to do with clf was between @jnivre and me, and I can confidently say that neither of us consider ourselves experts in this area. So, if you wanted to expand and even change the guidelines for clf, we'd be very happy to have you do it! I think this would still be possible as the relation is new and not really in use. (Nevertheless, at the moment, you should regard what is now on the clf page as the resolution of that discussion!)

That said, in what I will call the "simple" case of a quantified noun, I think it would be a real shame to lose the relationship nummod(horses, three) between the number and the lexical noun, and I do think there is reasonably compelling evidence for treating [number CLF] as a constituent, as mentioned in https://github.com/UniversalDependencies/UD_v2/issues/21 , and that's in short what argues for the current analysis. You are certainly right that classifiers are used in a number of other constructions. It could be that they don't all have a uniform analysis; I'm not an expert or sure. While classifiers normally come from nouns, I do think it is reasonable to regard them as highly grammaticalized, which would justify treating them as a dependent that semantically types the number. For [NUM + CLF], is there any strong reason why it is not okay to treat NUM as the head?

p.s. For "three cups of water", we give this an analytic analysis of

nummod(cups, three) nmod(cups, water) case(water, of)

This is a questionable analysis since the words preceding water often seem to behave like a complex quantifier for water (and @sebschu considers such an analysis in his enhanced++ representation in http://nlp.stanford.edu/pubs/schuster2016enhanced.pdf ), but I think it is reasonable to say that the English construction is much less grammaticalized and the regular noun phrase analysis with cups as head seems preferable. In general, UD analysis always has complications like this because languages always have words part way along grammaticalization pathways.

amir-zeldes commented 7 years ago

Is the preferred analysis the same for "one cup quinoa"? (in a recipe context)

dan-zeman commented 7 years ago

@amir-zeldes : I would not use clf in languages where classifiers are not considered an established part of the grammar. I think in English you could just say that there is an elided preposition of.

ermanh commented 7 years ago

Hi @manning, thanks for the detailed reply. I won't claim to be an expert on classifiers as a cross-linguistic phenomenon myself either, and was very much hoping that other people working on classifier languages could chip in (e.g., Vietnamese, Japanese), but since I happen to be working on two of these languages, Cantonese and Mandarin, I've found the data to be more complicated.

I do see the merits and the argument behind clf(NUM, CLF) after reading UD_v2#21. However, while it works well for Mandarin some of the time, when we delved into the data, and then considered Cantonese as well, it becomes problematic -- I'll try to summarize more succinctly what I've laid out in the previous posts:

(1) Chinese classifiers can occur without the numeral. The following are all valid (and common) NP configurations in both Mandarin and Cantonese:

(a) NUM CLF
(b) NUM CLF NOUN
(c) DET NUM CLF NOUN
(d) DET CLF
(e) DET CLF NOUN
(f) CLF NOUN

If the classifier must depend on the numeral (when present), then we are presented with a few potential problems, both analytically and pragmatically:

(i) There would have to be at least 3 ways to annotate classifiers: clf({NUM,DET,NOUN}, CLF). Dependence of the classifier on DET would occur in (d-e), and would follow from the fact that the determiner is at least similar to the numeral in its function in modifying the noun. Dependence on NOUN is "forced" in (f) in the absence of any NUM or DET.
(ii) @jnivre suggested appealing to the idea of promotion for the case of (f) (and I would assume (d-e) as well). This is not entirely bad for Mandarin (except for 3 below), but doesn't work so well for Cantonese.
(iii) In noun-less NPs (1a, 1d), one can argue that the classifier is standing in for the absent noun, rather than the numeral or the determiner. Anaphoric/pronominal uses of classifiers is well-attested in some languages, including Cantonese (see 2iii).

(2) Cantonese

(i) Bare classifier phrases -- [CLF NOUN] -- is how definite noun phrases are constructed in Cantonese (and also in a number of other Chinese languages, Vietnamese, and Hmong, among the ones I'm ware).
(ii) [CLF NOUN] and [DET CLF NOUN] are possible, but *[DET NOUN]. This would make clf(DET, CLF) badly motivated for Cantonese.
(iii) A reduplicated classifier functions as a quantifying pronoun.

These cases suggest classifiers in Cantonese are somewhat DET/PRON-like, and just as close to the noun in these contexts as they may seem close to the numeral when numerals are involved.

(3) Some classifiers are synchronically also nouns and noun-like

(i) As you mentioned in UDv2/#21 some classifiers are "massifiers" and therefore more noun-like if not also nouns synchronically (e.g., 袋 'bag' in 一袋米 one bag rice 'a bag of rice').
(ii) For Chinese languages, there's another subclass -- units of measurement. Concepts like "inch", "pound", "day", "minute", etc. are syntactically classifiers rather than nouns. (In fact they are rarely followed by a noun; e.g., 一年 one year, 兩公里 two kilometers.)

Treating the numeral as head in (3i-ii), but especially (3ii), would give us the opposite of what is desired by UD -- the "more meaningful" content word is now the dependent rather than the head.

In sum, I don't think it is a clear-cut case that classifiers are always closer to numerals as a constituent than they are to nouns -- at least with regards to Chinese languages and when considering all of the possible NP configurations involving classifiers.

Given the above, I would like to suggest treating the classifier as an independent child of the noun instead (and therefore sister of the numeral and determiner). And when the noun is absent, the classifier should be treated as head of the noun phrase:

NP with noun:
clf(NOUN, CLF)
det(NOUN, DET)
nummod(NOUN, NUM)

NP without noun:
det(CLF, DET)
nummod(CLF, NUM)

What do you think?

This is the only reasonable solution I can think of right now that would still preserve numerals and determiners as direct dependents of the noun, but at the same time avoid the problems I presented above and the "chaos" of having to treat classifiers 3 different ways depending on what's present/absent. It would also be more desirable to have Cantonese and Mandarin, two closely related languages, treat classifiers the same way.

The option of treating measurement units (and marginally massifiers) as regular nouns seems undesirable because that would mean using semantic criteria to override their syntactic distribution. Subdividing classifiers also seems, perhaps, unnecessarily overcomplicating.

jnivre commented 7 years ago

I can see the advantages of a flat analysis, which I considered myself initially, but it would be interesting to hear some more perspectives.

@wcroft What is your take on this?

wcroft commented 7 years ago

Hello all,

Greetings -- this is my first contribution to any UD discussion on github, so please accept my apologies for not being familiar with what went before (and for still needing to learn about how to format comments in github, which is also new to me).

I'm a typologist who got interested in UD, in part because the goals of UD overlap with my goals of teaching typological syntax to undergraduates; see my TLT15 paper. Only recently have I been interacting with some of the UD people, as you can see from the comments by @manning and @jnivre.

The functions of classifiers in Chinese are typical of grammaticalization paths associated with (numeral) classifiers, or more precisely their constructional origin in what was probably originally an anaphoric construction [nummod(CLF, NUM)] which became a modifier phrase via apposition [appos(NOUN, CLF)]. But I would not suggest such an analysis at this point, in part because in general I think the analysis should reflect current functions of the elements, not their etymology.

For this reason, I would support distinct analyses for the different functions:

Modifying construction, following the content word to content word principle: nummod(NOUN, NUM) & clf(NUM,CLF)

"Anaphoric head" construction, which should be the same analysis as used for English the green one, for example -- I assume one is treated as the head? nummod(CLF, NUM). [edited by CDM: originally by mistake: nummod(NOUN,CLF)]

Determiner construction: det(NOUN,CLF)

And of course ordinary noun uses should be annotated like ordinary nouns would be.

Although there is reluctance to use different dependencies for what is at least orthographically the same form, it's actually pretty common for grammaticalized elements -- look at English that (det, mark, or the grammatical role of the demonstrative pronoun -- subj,obj,obl). And one characteristic of more analytic languages (Chinese, Vietnamese, Hmong etc.) is the lack of phonetic or at least orthographic change of elements with different functions historically derived from a single form. Instead, I would suggest that the orthographic/phonetic identity across functions be captured through the POS tag.

The one interesting case whose analysis doesn't seem to fit what I suggest is DET NUM CLF NOUN, where the classifier isn't repeated after both DET and NUM. (This happens in other languages, where the classifier is on its way to becoming an agreement form.) I think that the best solution here is still to have CLF as a dependent on NUM rather than a flat structure. The general justification for making CLF a dependent on the modifier rather than the head noun is that typologically, CLF groups with NUM (or whatever) rather than NOUN, e.g. there are N Num Clf languages but not CLf N Num languages.

Finally, the CLF-CLF quantifier construction is an example of a lexically productive MWE, which would probably be best annotated as compound(CLF,CLF) in UDv2. At least, that's my view on lexically productive MWEs -- but that's another issue!

ermanh commented 7 years ago

Hi @wcroft, it's great to have an typologist on board. I am new to UD and the github discussions myself, so far from being in the capacity to welcome you, I'm very glad you're able to lend your expertise.

From what I understand, and if I may respond and ask for clarification on each point, you recommend:

1. clf(NUM, CLF) whenever a numeral is present (regardless of what else is present/absent*)

*I assume you meant nummod(NUM, CLF) instead of nummod(NOUN, CLF) in the "anaphoric head" paragraph?
The analogy of "the green one" -- assuming you mean treating "one" as the head as we would treat the numeral as head in the anaphoric [NUM CLF] construction -- doesn't strike me as a good equivalent? One wouldn't say "the green two" but rather "the two green ones" in English, but in Cantonese and Mandarin only the numeral would change (i.e., one CLF, two CLF, five CLF). The classifier-adjective link would also seem strange to me; at least for Chinese languages, classifiers seem to have a more (pro)nominal characteristic to them.
I hate to beat this point to a pulp, but Chinese measurement units (that are syntactically classifiers) depending on the numeral or determiner would necessarily go against the content-word-as-head parallelism to other languages (more on 4 below).

2. Similarly, det(DET, clf) whenever a determiner is present but a numeral is absent

Same concern as in the last point in 1.

3. det(NOUN, CLF) in determiner constructions

Did you mean that this would only apply to bare classifier phrases [CLF NOUN] -- but not [DET CLF] or [DET CLF NOUN], the latter two of which would be treated as in 2?

4. "ordinary noun uses should be annotated like ordinary nouns would be"

Would you recommend this approach for the Chinese measurement units (and massifiers) I mentioned in previous posts?
If yes, I have to stress that Chinese measurement units behave syntactically like classifiers more than they do like nouns, and they are typically treated as a subtype of classifiers in Chinese grammars. This is a relatively small and closed list, of course, but still they cover common and frequently used concepts like "day" and "year". A fourth subclass of classifiers that are affected and that I hadn't mentioned are verbal classifiers where the classifier represents a frequency (like "times" in "I jumped five times").
If not, do you mean that if a word has both classifier and noun uses, then the nominal use is treated as a noun (both in terms of POS and syntax) accordingly?

Regarding the CLF-CLF quantifier construction, we've actually been treating it as a single lexical unit and tag it PRON (or DET if modifying a nominal).

Just to play devil's advocate a little, I was going through Aikhenvald's typological volume Classifiers (OUP, 2000) and she notes in her chapter on numeral classifiers that, contrary to Greenberg's observation, the order CLF NOUN NUM is in fact found in Ejagham, a Benue-Congo language (p.105):

a-mɘgɛ ` i-čɔkud a-ba'ɛ  
NC1/6-CL:SMALL.ROUND GN NCL19/3-orange.seed NCL1/6-two
'two orange seeds'
(tones removed except for floating genitive linker; example from p.99)

What's also special here is that it is the classifier, rather than the noun, that triggers noun class agreement on the numeral. Aikhenvald also found some Kegboid languages where the classifier forms a phonological as well as morphological constituent with the noun instead of the numeral (pp.110-111).

Obviously, these represent a very small minority of languages (which may never appear on UD, but nonetheless) seem to be valid counterexamples to Greenberg's universal that classifiers and numerals are naturally a tighter constituent. Would this affect our analysis?

There are two other things I wanted to bring up, @jnivre @manning and everyone else not already mentioned but following this issue:

(1) Is it necessary (though I assume ideal) that all languages that have classifiers treat them the same way in UD?

(2) Regarding the POS tag, it seems the current recommendation is PART. We've had difficulty with this, and actually considered NOUN for Chinese languages due in large part to the massifiers and measurement units. We're uncomfortable with PART also because classifiers in Chinese languages are arguably only a semi-closed class (Aikhenvald also remarks that for languages that have "repeaters" -- doubling up a noun and using the double as the classifier for the noun -- have an almost open set as a result). CLF would be seem ideal but I have the impression that the UDPOS list is fixed and not going to budge?

jnivre commented 7 years ago

I will only comment on the general question of whether it is necessary that all languages with classifiers adopt the same analysis, and leave the finer details to those with more knowledge.

One of the golden rules of UD is: "Don't annotate the same thing in different ways!" It is the essential idea of UD. So, if two languages have essentially the same construction or phenomenon, it should be treated the same way. Otherwise, what is the point of trying to do cross-linguistically consistent annotation? Just having a common label set is worth nothing (and can even be misleading) if you don't use the labels in the same way across languages.

Therefore, it is extremely important that everyone working within UD makes the effort to rise above language-specific traditions and distinctions and tries to see the forest for the trees. A cross-linguistically consistent annotation must be relatively coarse-grained and therefore cannot capture all the fine details in every language, and we must constantly ask ourselves whether we are undermining comparability by trying too hard to capture every subtle distinction in our own favorite language.

However, it is also important to note that two languages may contain constructions that are regularly denoted by the same term, but which are really different in essence, perhaps because grammaticalisation has gone much further in one language than in the other. Light verb constructions, for example, are clearly grammaticalised in languages like Hindi and Persian, but not in languages like English. And consequently, we don't advocate the same analysis for the constructions called LVC in English as in Persian. So, when we talk about languages having "classifiers", we need to carefully decide whether we are talking about essentially the same type of construction in all cases. [CDM: This is the counterbalancing principle: "Don’t make different things look the same"]

dan-zeman commented 7 years ago

@ermanh , could I ask what makes measure units and words like "day" and "year" syntactically closer to classifiers than to nouns in Chinese languages? Is it just the fact that they themselves do not require another classifier in expressions like "two years" or "fifty kilometers", or is there more to it?

As for the UPOSTAG, I assumed that Chinese classifiers would be tagged NOUN. Is there a recommendation for PART somewhere in the UD documentation? I see it neither at PART nor at clf but the documentation has become quite complex and it is easy to overlook stuff.

wcroft commented 7 years ago

Let me start with the general issue that @jnivre brings up. It is definitely important to annotate the same thing in the same way in different languages, otherwise crosslinguistic comparability is compromised. But I would say that what is the same (comparable) across languages is a particular function performed by a word. This is what I had in mind with clf in the passage that @manning cited in https://github.com/UniversalDependencies/UD_v2/issues/21: I only intended it to refer to the strategy for combining modifiers (numerals, possessives, demonstratives, etc.) with their head noun. It so happens in Chinese and in other languages that the words called "classifiers" are also used for other functions (anaphor, determiner). But that is a language-specific fact: in other languages the word used in the classifier (modifying) function does not also have those other uses.

So the functions are crosslinguistically comparable, but the word classes, in the sense of the full range of functions of a particular word in a language, are language-specific. When I used clf in my post, I meant the function; but when I used CLF for the Chinese examples, I meant the Chinese-specific word class.

This isn't always easy to determine. I for one don't think that Persian LVCs are so different from English ones that the two should be annotated differently. But that's for another thread.

On the specific points that @ermanh raises:

Yes, that was an error in my post; it should have been nummod(NUM,CLF).
I made another mistake regarding the anaphoric head type. I meant nummod(CLF,NUM). Sorry I made so many mistakes! I hope that makes clear what I had in mind, and the similarity between the constructions in the two languages.
That's great that you checked out Greenberg's generalization in Aikhenvald 2000! I didn't remember those. Still, Aikhenvald said they are very rare. She also mentions that the noun is a genitive in both Ejagham and the Kegboid languages. It suggests that the construction is etymologically nummod(CLF,NUM) & nmod(CLF,NOUN). But I think the construction has probably grammaticalized enough to be analyzed the usual way despite the word order.

Which leaves the measure constructions. Measure terms are similar but not identical to classifiers, and most typologists exclude them (otherwise all languages would be numeral classifier languages, and what makes numeral classifiers distinct is lost). See my 1994 Word paper on classifiers ("Semantic universals in classifier systems", Word 45:145-171) and references cited therein.

Having said that, there still remains the question of the best analysis of measure expressions. Etymologically they are often genitive constructions, and that is how UDv2 has annotated them. But that is somewhat problematic from both a syntactic and a semantic perspective. Syntactically, three cups quinoa isn't really genitive any more. Semantically, in She drank three cups of coffee, what is drunk is the liquid, not the container, and English is liberal enough to allow She drank three coffees. So I would be inclined towards a complex quantifier analysis. But to be honest, I don't know of a typological survey of measure expressions. Only after seeing what such a survey turns up would I recommend a particular analysis with some confidence.

ermanh commented 7 years ago

Hi @dan-zeman, Chinese classifiers and nouns differ syntactically in a number of ways, and measure words side with classifiers in all of the following. In general:

As you mentioned, and the most obvious, classifiers cannot be preceded by another classifier
Numerals can modify bare classifiers, but not bare nouns
Adjectives can modify/immediately precede bare nouns, but not classifiers
Bare nouns can be possessed, but not bare classifiers
Certain quantifying determiners can modify bare classifiers, but not bare nouns
Some other quantifying determiners modify bare nouns, but not classifiers (or [CLF NOUN])
(Monosyllabic) classifiers can reduplicate and function as quantifying pronouns or determiners, but nouns cannot

    classifiers     measure units       nouns

1.  * 一 個 條     * 一 個 年     一 個 蘋果
    one CLF CLF:long    one CLF CLF:year    one CLF apple
                            'an/one apple'

2.  三 條         三 年         * 三 蘋果
    three CLF:long      three CLF:year      three apple
    ‘three (of sth. long)'  'three years'       

3.  * 三 長 條     * 三 長 年     三 條 長 線
    three long CLF:long three long CLF:year three long line
                            'three long lines'

4.  * 我 的 條     * 我 的 年     我 的 蘋果
    1sg GEN CLF:long    1sg GEN CLF:year    Isg GEN apple
                            'my apple(s)'

5.  每 條         每 年         * 每 蘋果
    each CLF:long       each CLF:year       each apple
    'each one'      'each year'     

6.  * 任何 條      * 任何 年      任何 公司
    any CLF:long        any CLF:year        any company
                            'any company'

7.  條條          年年          * 線線
    CLF-CLF:long        CLF-CLF:year        line-line
    'every one'     'every year'

Having said all that -- and in light of @wcroft's last reply -- there are certain things that make the measure units mentioned above, along with "massifiers"/container/group classifiers, etc. (e.g., 'cup', 'box', 'group', etc.), differ from the more canonical classifiers. Most notably, the possessive particle 的 can occur between the non-canonical classifier and its following noun (e.g. 一磅 (的) 肉 one pound POSS meat 'a pound of meat'). Measurement units are more likely to show this variation than massifiers, etc., but in any case, in this regard they are also noun-like.

[I'll come back to the POS tag at the bottom.]

@wcroft, thanks for clarifying. Just to recap you recommend that (i) classifiers be linked with clf only when both the numeral/determiner/possessive and head noun are present, but (ii) det when it seems to function like a determiner (in bare classifier phrases), and (iii) as the head of the noun phrase when the head noun is absent (the anaphoric case).

I can get behind that -- we already agree on the anaphoric case, and our team had actually briefly considered det for the Cantonese definite [CLF NOUN] noun phrase.

What made us hesitate on det though was because classifiers do not currently have a unique POS tag, giving us an undesirable situation where hundreds of classifiers in Chinese languages are indistinguishable (tag-wise) from the other words that it shares its POS tag with, while each of the classifiers have at least 2 different possible functions, where the non-clf ones would obliterate their classifier identity. Currently it seems the only way to consistently label a classifier as a classifier (at least in Chinese languages) would be to use feature annotation.

To reiterate though, Chinese languages are not the only ones that have anaphoric and determiner functions of classifiers; the latter seems less common cross-linguistically (nonetheless in Vietnamese and Hmong besides Chinese languages), but the anaphoric function is found at least in Japanese, Burmese, Malay, and the languages I already mentioned that have the determiner function.

Back to your question, @dan-zeman, I must have gotten the impression that the recommended tag would be PART because the Mandarin examples on the clf page suggest that classifiers should be treated as function words, and the only function word category that classifiers would fit in is PART. It's weird to think of a NOUN dependent on a NUM (or a DET). But conversely, it's also weird to have a PART function as head of a noun phrase in the anaphoric case (though I wasn't aware @wcroft's recommendation wouldn't extend to the anaphoric and determiner constructions).

wcroft commented 7 years ago

A couple brief comments on POS tagging of Chinese classifiers. The UD POS tags make up a small set and so that will require lumping together of words that have somewhat different syntactic distributions. This is necessary to improve crosslinguistic comparability, and to make a practically useful POS tagset.

But in fact, even the traditional word classes given to languages lump together words that have somewhat different syntactic distributions, if one gets serious and looks at all the relevant distributions. In fact, really careful studies indicate that distribution patterns are so variable that any system of word classes ignores some distributional variation (Radical Construction Grammar, pp. 34-47). So UD POS tagging is only doing the same thing as language-specific grammatical traditions, but in a more coarse-grained way for practical reasons.

ermanh commented 7 years ago

@jnivre, @wcroft, thanks for the clarifications and re-emphasis on the tenet of crosslinguistic comparability. I hope I wasn't coming off as antagonistic in some way in my attempt to illustrate how and why the Chinese data seemed like such a headache (at least to me) in my effort to figure out how to deal with classifiers in UD.

In any case, I'm happy with where the conversation has led, and it looks like we've reached some form of agreement and interim conclusion so far per @wcroft's arguments. Let me summarize them as I understand it:

clf is reserved for classifiers when they accompany certain words modifying a noun, those modifier words most likely being numerals, and in some languages may also be determiners and possessive (pro)nouns. If different types of these modifier words are all present within the noun phrase, the classifier should attach to the numeral.
Classifiers in some languages may have additional functions that should be treated differently, including:

2.1. If a classifier is used anaphorically -- i.e., there is no noun within the same noun phrase -- then the classifier is promoted to head of that noun phrase, and any accompanying modifiers are labeled and attached to the classifier as if it were a noun.

2.2. If there is a genitive marker linking the classifier to the noun, treat the classifier as a noun (use nmod).

2.3. In some languages, classifiers may also appear in ((in)definite) constructions where a noun phrase consists of only a noun and a classifier). In this case, we treat the classifier as a determiner modifying the noun with the label det.

The recommended POS tag for classifiers is NOUN, regardless of their syntactic function in context. Languages that wish to separate classifiers from regular nouns may use feature annotation (perhaps NounType=Clf ?)

Do the above sound right/look reasonable to everybody?

wcroft commented 7 years ago

@ermanh, your conclusions in 1-2 follow my suggestions except for 2.2 -- I wrote that for Ejaghom and Kegboid an nmod analysis is etymologically correct, but synchronically it should be analyzed like other numeral classifiers, i.e. with nummod (NOUN,NUM) and clf(NUM,CLF). Not that we're going to see a treebank of a Kegboid language in the near future.

But I defer to @jnivre , @manning and the UD core team for a final judgement!

jnivre commented 7 years ago

This all sounds great to me. Let me also say that I think this has been a very constructive discussion and not the least bit antagonistic.

jnivre commented 7 years ago

I forgot to say that it would be excellent if someone could write this up and add it to the documentation for the "clf" relation.

dan-zeman commented 7 years ago

@ermanh, thanks for the detailed comparison with measure words! Yes, I also think that a feature (language-specific at the moment) is the way to go when a distinct subclass of a POS category is to be labeled. NounType=Clf sounds perfect to me.

@jnivre, maybe we should put the summary not only to the documentation of the clf relation, but maybe also somewhere in the general list of constructions. After all it is not just about the structure but also about the UPOS tags and possibly features.

manning commented 7 years ago

Thanks, @ermanh and @wcroft (and welcome!). Not antagonistic at all. I think the tension between good language-particular analyses vs. good language-universal ones is one we all struggle with, particularly when thinking about the details of languages that we know well.

Catching up, my summary is that there are two sort of plausible analyses, the "historical" one where you have the number word modify the classifier, and the "grammaticalized" one where you have the classifier as a dependent of the number word. I still find the third possible analysis, the flat one where both are dependents of the head noun, rather unappealing, in part because of Greenberg's almost universal.

So, I'm happy to endorse the emerging consensus! I'll start adding a few examples to the clf page, but I'm sure others could help with further documenting the decisions.

p.s. @wcroft: You can edit entries. Double click the pencil icon at the top of an entry. I corrected your original post for the anaphoric case, leaving a note about the original. I hope that makes it a bit easier for others to follow the thread.

qipeng commented 7 years ago

A bit late (and new) to the discussion, but here are some thoughts about the analyses and classification of measure words. A lot of this build on knowledge about Mandarin and limited knowledge about Cantonese and Japanese, and that I have personally struggled with this particular case when working on UD for Chinese (primarily Mandarin).

Analyses The main discussion seems to center around the construct [DET|NUM]* CLF (NOUN)* (abusing regex notations a bit). Regardless of order of these elements, it seems that CLF is really only necessary when both the [DET|NUM] part is present or ellipsed, and that the NOUN is present or ellipsed. Therefore, a flat analysis seems less appealing because it doesn't seem to reflect the connection between [DET|NUM] and CLF. Arguing from the UD gold standard ("same for same"), having CLFs as clf dependents of the [DET|NUM], which in turn modifies the NOUN head with proper relations, seems to be a more appealing proposal (also in @ermanh 's nice summary: https://github.com/UniversalDependencies/docs/issues/374#issuecomment-275605645).

In the case of anaphora, I tend to agree with @wcroft 's analysis, because essentially CLF is taking the role of the missing component, and hinting at the ellipsis thereof. For the definitive construct CLF NOUN in Cantonese (e.g. "架車 car-measure car"), it seem reasonable to assume that the CLF is really present in place of the missing [DET|NUM] CLF construct, i.e. "() 架車 (the) car-measure car" or "(一) 架車 (one) car-measure car", which can be unified with the similar case in Mandarin, but note this might need to depend on context.

Unfortunately, as parallel as this is to the English analysis, the discrepancy is still very notable, e.g. "一杯水 one cup water" and "a cup of water", but that might be an entire different discussion.

Measure word I personally think that despite the umbrella term "measure word", in Mandarin, some of them are actually playing the role of a noun, depending on context. Aside from the various properties @ermanh mentioned here (https://github.com/UniversalDependencies/docs/issues/374#issuecomment-275059106), I think there is actually a simple test based on anaphora. For instance, (a) 四磅牛肉 four pound beef (b) 牛肉重四磅 beef weighs four pounds (c) 我买了四瓶 I buy (past-tense-marker) four bottle (d) 我买了四个瓶子 I buy (past-tense-marker) four (measure) bottles In (a), I would argue that "磅(pound)" is clearly a CLF, as it is modifying the duration of the noun head "beef", together with "four". In (b), it is less clear that "four pounds" is an anaphoric construct as it's grammatically correct standing alone, where I would argue that "磅(pound)" is more of a NOUN than a CLF as the expression is complete in itself. In contrast, (c) seems to be a stronger case where anaphora is present (because "bottle" is not the entity being bought, but the contents thereof; for the former see (d)). Thus in (c), "瓶(bottle)" is more clearly a CLF and not a NOUN.

This is apparently orthogonal to how we should POS tag these words in the respective cases, but I thought this distinction is clear enough to make a difference thus sharing it here.

ermanh commented 7 years ago

@jnivre, @manning, @dan-zeman, glad to hear it, and thanks @manning for updating the clf page.

@wcroft, regarding 2.2, I was actually also (and mostly) thinking of the Chinese cases where measurement units can optionally be followed by a genitive marker before the noun. For example:

一 年 (的) 時間
one CL:year (GEN) time
'one year's time'

一 磅 (的) 肉
one CL:pound (GEN) meat
'a pound of meat'

There is practically no semantic difference with or without the genitive marker, so I was thinking that if the genitive marker is present, one would treat the classifier as a head noun as an nmod modifier of the "real" noun. Otherwise I'm not sure what we should do with the genitive marker?

Or in other words, if there is a genitive marker of some sort, should the classifier be treated as in English as a possessive phrase rather than as a "real" classifier? Is that where we might draw the line between a classifier analysis and an "English" analysis (besides whether the language in question has a "traditional" grammatical category for classifiers)?

Hi @qipeng, thanks for the summary and comments, glad to hear you agree with where this conversation has led. I'm curious about one comment you made though, and wonder if you could elaborate. You said:

For the definitive construct CLF NOUN in Cantonese (e.g. "架車 car-measure car"), it seem reasonable to assume that the CLF is really present in place of the missing [DET|NUM] CLF construct, i.e. "() 架車 (the) car-measure car" or "(一) 架車 (one) car-measure car", which can be unified with the similar case in Mandarin, but note this might need to depend on context.

It sounds like you're saying there are some cases in Mandarin where a bare classifier phrase (CLF NOUN) have different usages that should not be treated as a det relation between the CLF and NOUN? If that's the case, could you give some examples? I'm wondering if we would need/want to specify those cases (especially if they might apply to other languages besides Sinitic ones).

wcroft commented 7 years ago

The discussion about classifiers and measure expressions led me to take a quick look at chapter 1 of Karen Adams' monograph on classifiers in Austroasiatic, which can be downloaded here. Adams describes a variety of quantification types in Table 1.1, page 4, but later says, correctly I think, that they can be thought of essentially as cases of "counting" or "measuring" (see pp 9-10). So I think that it is useful to treat counting constructions as distinct from measuring constructions, even if in some languages the morphosyntax is similar. An important typological differences is that all languages have measure words, whereas only a subset of languages (albeit numerous and widespread) have classifiers ("counting words"). So I would restrict the clf dependency to classifiers as a modification strategy. (There are also possessive and other types of classifiers, and numeral classifiers are extended to other types of modifiers, so clf will have quite a bit of utility.)

As for measure constructions, there is no typological survey I am aware of. Two common strategies for measure word + noun are simple juxtaposition and a genitive construction. Of course many languages actually use simple juxtaposition for their genitive construction.

qipeng commented 7 years ago

@ermanh I was more trying to unify notations and give a universal analysis in that example.

Actually for Mandarin, I just realized CLF NOUN could be somewhat ambiguous/not distinguished. Using your example:

她买了本书 She buy PERF CLF book She bought a book

I (as a native speaker) would instinctively consider this [CLF book] part to be equivalent to [one book] (which suggests a nummod(book, CLF) analysis), although [a book] is also possible (det(book, CLF)). But given the absence of articles in Chinese (and many other languages), I would actually support the nummod analysis, as no strong sense of indefiniteness is necessarily expressed or implied. Note that this is different for the Cantonese example

架車好震 CLF car very shaky The car is very shaky

where det(car, CLF) seems the only sensible analysis. However, "佢買咗 [架車] She bought [CLF car]" seems more similar to the case above in Mandarin, where both nummod and det seem admissible.

There is also some interesting discussions about indefinite articles and unity numeral on StackExchange.

ermanh commented 7 years ago

@wcroft, it sounds like you're inclined to suggest that measure words be linked to the noun with nmod, regardless of whether the construction is genitive, and regardless of whether they are regarded as classifiers in individual languages, right? -- Or in any case, never labeled with clf.

Just to make sure we all mean the same thing by the same terms, by "measure words" do you mean the ones including what I have been calling "measurement units" in Chinese, but not "massifiers" and "container words" such as 'bag' in 'one bag (of) rice' or 'cup' in 'one cup (of) water' (judging by how the pdf you linked categorizes classifier types in Table 1.1)?

I assume measurement units/terms are likely to be a closed set and easy to define cross-linguistically. (It would definitely be easy for Mandarin and Cantonese.)

ermanh commented 7 years ago

@qipeng, I think I would be more inclined to use det(CLF, NOUN) even in the indefinite case in Mandarin. There does seem to be some parallel semantic correspondence between Mandarin [CLF NOUN] and English ['a' NOUN] on the one hand, and Mandarin ['one' CLF NOUN] and English ['one' NOUN] on the other hand, where including the numeral seems to add emphasis on the exact quantity (as one of the posters on your StackExchange link also points out). It's true there aren't articles in Chinese languages, but functionally the classifier does seem like a close equivalent in these cases.

Cross-linguistically within the Sinitic family, there also seems to be a lot of variation in the syntactic and semantic distribution of bare classifier phrases, from languages where they can have both indefinite and definite interpretations no matter whether they are preverbal or postverbal, to Mandarin where they can only occur in postverbal position and only have an indefinite interpretation (I can't find an online copy, but here's the reference; Google books may also allow a preview).

In the languages where it is possible for [CLF NOUN] to have either indefinite or definite interpretation, it would be impossible to decide between det and nummod unless the context is available. nummod would exclude the definite interpretation by default, but det could go either way, so for these languages det seems preferable in this situation.

Using det also for the indefinite bare classifier phrase in Mandarin, then, would keep the different Chinese languages together in their application of UD. What do you think?

qipeng commented 7 years ago

@ermanh Thanks for the comments. I don't disagree with your interpretations. I think the root of the problem here is the lack of distinction of indefinite reference and unity reference in these languages (compared to Indo-European languages where indefinite articles are present). I'm happy with defaulting to det for CLF NOUN phrases, but I do tend to believe that sometimes CLF NOUN is better interpreted as nummod, although the distinction is nuanced and often lost in translation (also unfortunately, convincing examples I can come up with seem to be mostly in conversational/spoken Mandarin).

wcroft commented 7 years ago

Regarding measure terms: @ermanh, Karen Adams argues that all the words in her Table 1.1 can be categorized as either measuring or counting. I would include words like cup and bundle as measure words, since they measure out some quantity of the noun referent.

The more interesting question is: what is the head of a measure construction?

It is unclear which is the head in a cup of coffee, cup or coffee. I drank a cup of coffee is perfectly fine, but I broke a cup of coffee is unacceptable. On the other hand, He smashed the bottle of wine is OK; definite phrases aren't usually interpreted as measures. For me, He drank two bottles of wine and He smashed two bottles of wine are both OK. But this just indicates that bottle can be a measure word as well as an ordinary noun (referring expression).

The issue here is that if the word is really just measuring the noun (the coffee), then one can argue that the noun is the head, following the content-word-to-content-word principle. The semantic constraint imposed by the predicate applies to the noun. But if the word is describing a physical object (the bottle), then that word is the head -- and it's not really a measure word. This doesn't apply to all such words: gallon or pound are only measure words, I think.

Whether you agree with treating measure terms as dependents and the noun as the head depends on your goals. If it's useful to relate the UD dependencies to the semantic selectional restrictions of predicates, then this is a reasonable way to go. One could stick to the nmod analysis for English of coffee, but then it will be less crosslinguistically comparable to languages like Chinese, which do not use their genitive construction for their measure construction. Chinese linguists could argue, with reason, that using nmod is imposing an English-like analysis on Chinese measure constructions. But we can all agree that both languages have measure constructions, though they are expressed using different strategies.

This would still leave the question of what dependency to assign to the measure term in a cup of coffee, if coffee is the head. I see that nummod in UD is restricted to numerals, and other quantifying expressions are analyzed as det. I suppose you could call a cup of a complex determiner. My own personal view is not to distinguish different types of modifiers apart from nmod since the differences are mainly semantic. There are a fair number of peculiar types of modifiers like measure terms that the det-nummod-amod categories aren't well suited to capture. Having just joined these discussions, I don't know if guidelines have already been laid down for measure expressions.

dan-zeman commented 7 years ago

@wcroft : I don't think (or remember?) we have guidelines specifically for measure expressions (there is a lot of constructions we have to cover yet!) but I think we do not want to make a cup of a function word/expression in English. We always struggle with how much semantics we want to allow to guide our decisions in UD. I believe we do not want to stretch the parallelism between languages so far that we hide interesting differences in strategies the languages use. So when there is a genitive construction, expressed either by morphology or by a preposition as in English, I would analyze it similarly to other genitive constructions: nmod(cup, coffee).

jnivre commented 7 years ago

I agree with Dan. This is similar to (not fully grammaticalised) light verbs, where we follow the "surface syntax", at least in cases where there is an overt preposition. Swedish is a bit different because we literally say "a cup coffee", and it is indeed a hotly disputed question in Swedish grammar which noun is the head. As a matter of fact, the current version of UD Swedish treats "kaffe" ("coffee") as the head and "en kopp" ("a cup") as "nmod". I don't have a problem with the Swedish analysis being different from the English one, because there is a real difference in realisation strategy, but I would also be prepared to revise the Swedish analysis if people find it preferable.

wcroft commented 7 years ago

OK. The annotation of measure constructions is an instance of a tough problem, namely that in the grammaticalization of certain common constructions, what we at UNM have come to call "head flipping" takes place:

Verb (head) + Complement > Auxiliary + Verb (head)
Relational Noun (head) + Noun > Adposition + Noun (head)
Quantity (head) + Noun > Quantifier + Noun (head)

Quantity expressions include numerals, measure terms, quantifiers, all of which in many (most? all?) languages start out as having the Noun in a genitive construction, like English a cup of coffee. You see the grammaticalization process farther along in English in expressions like a cuppa (British English), or a lotta books.

The problem is, at what point do we decide that the head has "flipped"? Grammaticalization theorists also debate this question, and there is no consensus on the answer, I'm afraid. Some argue that there has to be a "visible" morphosyntactic or phonological change before saying the analysis has changed; I think Traugott & Trousdale (2013) take this view. Others argue that the change happens before that; this is the Harris & Campbell (1995) position. I am on Harris & Campbell's side in this debate although I disagree with them on other issues. There are usually distributional facts that indicate the reanalysis has occurred, such as the drink/break examples I gave above.

The reality is that change is gradual and it takes time. It starts when the reanalysis--essentially a semantic change--occurs and it ends after morphosyntactic and/or phonological changes reflecting the reanalysis are all actualized. But when you have a practical goal like UD annotation where you have to draw a sharp line, then you have to decide where to draw the line in this continuum. I am inclined to draw it closer to the start; @dan-zeman and @jnivre draw it closer to the end. Drawing it near the start conforms more closely to the content-word-to-content-word principle, and yes, to the semantics including selectional restrictions, and to crosslinguistic comparability. Drawing it near the end conforms more closely to the source syntax (the single-language analysis), especially when few or no "visible" changes have yet occurred. But there is no perfect solution. Sorry!

PS: This doesn't mean that Chinese measure constructions should be annotated nmod(cup,coffee). It's probably better to annotate it like Swedish, namely nmod(coffee,cup) since the Chinese strategy is more like the Swedish strategy.

jnivre commented 7 years ago

@wcroft Thanks for a very nice summary of the issues. I completely agree with the overall characterisation of the problem, although perhaps not about where to draw the line. I think it would be really useful if we could work out some more systematic guidelines about what kind of evidence should count in order to promote cross-linguistic consistency.

wcroft commented 7 years ago

You're welcome @jnivre ! This will be (part of) what I plan to talk about at the UD workshop. The important thing is to have systematic guidelines. It's such a complex issue that we will probably need guidelines for individual constructions, although some general principles may emerge.

Just to illustrate the problem further. How are Russian numeral-noun constructions annotated in UD? If you follow the source syntax, then the constructions with numerals ending in 'one' (1, 21, 51, etc.) would be annotated amod(noun,numeral) because 'one' agrees like an adjective and the noun is in the nominative (or whatever case its governor requires); all other numerals except those ending in 'two' (2, 22, 52, etc.) would be annotated nmod(numeral,noun) because the numeral doesn't agree and the noun is in the genitive; and who knows what to do with numerals ending in 'two' because the numeral agrees like an adjective (albeit irregularly and defectively) but the noun is in the genitive -- or whatever case its governor requires. (And that is not all the complexity of Russian noun phrases with numerals!) If the Russian numeral-noun constructions are annotated simply nummod(noun,numeral), then the annotation is following the semantics -- but it's a heck of a lot easier to annotate. (Or as we sometimes write, a heckuva lot easier to annotate.)

ermanh commented 7 years ago

I've come upon another tricky situation and would like to ask for everyone's input again. According to the ADJ page, ordinal numerals in all (?) languages should be treated as adjectives (and therefore also labeled amod when modifying a noun). The page specifically mentions Czech as an example where ordinal numerals are traditionally called numerals:

Note that there are words that may be traditionally called numerals in some languages (e.g. Czech) but they are treated as adjectives in our universal tagging scheme.

In Chinese, ordinal numerals are also traditionally classified as numerals, because syntactically they behave like numerals rather than adjectives. I.e., they must be accompanied by a classifier and cannot directly modify bare nouns. For example,

第二 本 書
dier ben shu
second CL:book book
"the second book"

We have amod(書, 第二) if I interpreted the ADJ guidelines correctly above, and that leaves us with a "dangling" classifier. Adjectives shouldn't take classifiers as clf dependents -- only numerals and determiners/pronouns as we've discussed previously -- and neither should the head noun (take a classifier as a dependent). It also doesn't seem right to treat the classifier as a det dependent of the noun as we have agreed to do for determiner [CLF NOUN] phrases, given that in such phrases in Chinese, an adjective would typically come in between the classifier and noun (unlike the above example withe the ordinal numeral 第二) rather than before the classifier.

wcroft commented 7 years ago

A long time ago I did a crosslinguistic survey of different types of modifier constructions, which I never published. In doing so, we started from a semantically-based classification of modifier types. In that classification, ordinals were grouped with words like next, last and other in a distinct class which we called "set-member": the modifier is used to pick out a referent from an already established set, usually an ordered set.

I don't remember the typology of ordinals or more generally set-member terms; I have dug out the data for the relevant chapter in my Morphosyntax textbook, but haven't gotten to it yet. But European languages have adjective-like encoding of the modifiers, and other languages including Chinese have more cardinal-like encoding. So we're in a similar boat as with the measure terms: English and western European languages use a strategy that would be annotated with one UD dependency, Chinese uses a strategy that would be annotated with another one.

In our almost-UD pedagogical annotation scheme, we don't distinguish modifiers (det, nummod, amod), so we would call cardinals, ordinals and adjectives by the same dependency (mod). Our reasoning was that this distinction was largely semantic and we get in trouble when strategies crossed these semantic boundaries, or different strategies were used across languages for other semantic modifier types, like set-member modifiers.

I should add that for our pedagogical annotation, we do distinguish nmod and acl from mod, so we still get into similar trouble where languages vary between, say, a mod-like vs. nmod-like strategy for some type of modifier --- like measure + noun discussed above. There's no escape from the problem. Speakers can and do employ very different strategies in different languages. We just need to agree on a consistent guideline. I personally wouldn't be in favor of annotating English ordinals one way and Chinese ordinals another way, though.

I'm not sure what I would recommend, if you all agree that ordinals should be annotated consistently (apart from lumping the modifier types together to mod, that is). If we use amod, then it's a rather Eurocentric guideline, which some may object to. There was some concern in the audience at the TLT15 UD panel discussion about "imposing" English or Eurocentric analyses on other languages --- something that typologists worry about as well. If we use nummod, then it's a Sinocentric guideline. Take your pick: Western hegemony or Chinese hegemony! :smile:

jnivre commented 7 years ago

Although it may not be consistent with the current documentation for ADJ, I personally don't see a problem with treating ordinals as ADJ in some languages and NUM in others (and similarly for the syntactic relations) if there is clear evidence that they behave more like adjectives in one case and more like numerals in another. To me this falls under the slogan: "Make languages as parallel as possible but not any more parallel".

wcroft commented 7 years ago

While I think that what @jnivre suggests is the best way to go in some domains of grammar, I think it's risky in this particular case.

I assume that the motivation for analyzing Chinese ordinals as nummod is the use of a classifier, and the motivation for analyzing ordinals in many European language as amod is the use of number-gender agreement (morphologically similar or identical to the agreement used on property concept modifiers).

The risk is that if we use these strategies as criteria for use of one dependency over another, then to be consistent we should apply it in other cases. This would lead to annotating Chinese demonstratives with nummod since they use classifiers, and Spanish demonstratives as amod since they agree with the noun like adjectives, and so on. One could counter that there is already det to handle demonstratives. But that means that in effect a semantic criterion is used for demonstratives (deictic modification) vs. a syntactic one for ordinals.

Although I need to check the typological data, I think I can pretty safely say where the problems lie:

Modifiers are semantically very diverse (definiteness, deixis, set-member, numerals, quantifiers, properties, etc.), unlike say arguments, which are arguably all of the same type (entities).
The different strategies found in modification (agreement, classifiers, linkers, "genitive" case) are found with many different semantic types of modifiers; that is, the strategies are quite promiscuous across different modifier types.

Creating a different dependency for each type of modifier creates too many dependencies, and makes them effectively semantic. UDv2 has a compromise, with 5 different types of modifiers (det, nummod, amod, nmod and acl). In practice, I imagine that most modifiers are annotated based on their semantic type (numeral, deictic, property concept, etc.). But relying on strategies to annotate semantic types of modifiers that don't fit easily in these categories, like ordinals, means that sometimes modifiers are annotated based on semantics, sometimes based on syntax.

My solution is to not distinguish the first three modifier types, and instead use mod. I did it mainly for practical pedagogical reasons. The fact that modification strategies are pretty promiscuous across modifier types provides another motivation to do so. I suppose that if I were to take the empirical typological reasons seriously, I probably shouldn't distinguish nmod either, because it uses the same range of strategies (distinguishing acl is more defensible). But nmod is basically a semantically-defined modifier type (entity modifier), singled out because it can take its own modifiers in the usual way. I can't think of any language whose noun modifiers take their own modifiers in a different way than regular noun phrases do. So there are good syntactic reasons to single out nmod, and a straightforward, albeit semantic, means to reliably identify noun modifiers across languages.

In contrast, I have come to the conclusion that when it comes to subject, object, and oblique arguments, we should preserve that three-way distinction in annotation, and we should just annotate what the language does (ignoring differences in alignment strategies, which we agree that we shouldn't annotate).

First, there is evidence of a typological division between core and oblique arguments, although the boundary is a bit fuzzy. While not all languages index all core arguments, indexation of obliques is virtually not found. While overt case marking of core arguments occurs (accusatives, ergatives and very rarely, nominatives), zero case marking of obliques is virtually not found, apart from a few well-defined special cases that are anyway less argument-like (e.g. measures). Word order is more ambivalent: while subject < object < oblique is near-universal among VO languages, in OV languages object and oblique order varies a lot (subject is almost always initial). That may be due to attraction of the object to the verb, combined with the avoidance of postverbal elements in verb-final languages.

Second, core-oblique alternations do often have an information-packaging difference, or at least reflect participant roles which are ambivalent as to their information status, such as low-transitivity patient-like participants, e.g. English look for (oblique) vs. Spanish buscar (object).

So when it comes to subject/object (core) and oblique, (a) there is arguably a non-semantic motivation for the distinctions, and (b) there are enough differences in core/oblique strategies cross-linguistically that we can fairly safely rely on the strategies to decide most of the borderline cases in different languages. However, I don't think either of these conditions hold for different modifier types.

I know my view on modifiers is different from UD. And sorry for the long post. But this discussion has been very helpful in making me think harder about the issues.

jnivre commented 7 years ago

Thanks, @wcroft. It seems we need to think carefully about modifiers for future versions.

ermanh commented 7 years ago

Considering that the current UD guidelines already take a strictly semantic approach to det and particularly DET, and with respect to both your responses thus far, it seems that at this moment it would be best to apply a semantic approach and tag/label Chinese ordinals with ADJ and amod (unless we're switching over to Chinese hegemony)? But what to do then with the classifier -- would the best solution be to say that ordinals are a special subclass of ADJ that can take a clf dependent?

wcroft commented 7 years ago

Classifiers appear with a variety of modifiers: classifiers that are presumably numeral classifiers in origin get extended not only to demonstratives and ordinals, but also to property modifiers (adjectives) in some languages. Also, there are possessive classifiers in some languages (of independent origin). So I think clf should be allowed as a dependent of a variety of modifier types.

XinyingChen commented 7 years ago

@ermanh maybe it is a bit too late for this discussion. I would just add some points from a different perspective.

Why not a flat annotation? Fist, as @wcroft described, num and clf is more like a 'phrase', if we do a flat annotation, we will loss this valuable syntactic information. However, this information can be perserved, in a way, if we do the 'num head' annotation now. Second, as @manning said, with the annotation scheme now, we can easily promote the clf node, but the promotion will get more complicated if we do a flat annotation.
Why PART not NOUN? First, I agree that these two are both not very good options. The best solution would be we can have a 'clf' POS tag. Due to the relatively strict restriction of UD POS framework, we can only choose an option from available tags in the set. The reason to choose PART is because of the engineering motivation: NOUN is so commonly used comparing to PART. It will be easier to identify and convert PART into something new, if we have new choices for POS in the future. Of course, for current shceme, the clf is kind of identified with labels (such as 'nmod:clf'), but not put a double insurance? The true question here, in my eyes, is whether should we introduce a new POS tag rather than choose between PART and NOUN. Because, as mentioned before, they are both not good options.

Last but not least, it is true that there are many debatable decisions in UD, however, it is very important to look at the questions with understanding the goal and the language representation framework of UD. UD is for sure not a traditional framework. It is something between surface syntactic layer and deep syntactic layer (terms in MTT). And for making different languages more comparable, it uses a coarse-grained framework which has more difficulties to capture detailed variations than a fine-grained one. That is the compromise it (or any framework) has to make and it is also a dilemma may have no perfect solutions. But of course, it can improe with more discussions as this.

My personal concern about UD is more related to the question that brought up by @kimgerdes and Sylvain Kahane in their paper 'Dependency Annotation Choices: Assessing Theoretical and Practical Issues of Universal Dependencies' (here). That is, should UD go to a way that between surface syntax and deep syntax or it should choose a more surface syntax way? I can see why UD chose the current framework, but it does cause some problems with coherence and ambiguity. Is it really worse or more work to distinguish these two layers? For my understanding, it should not be so difficult to get a deep syntax treebank once we have the surface syntax treebank. Yes, it will be one more step, but first, it may can produce a better result and second, it will also have a more coherent framework with existing ones. Maybe, it is important to reconsider the question 'why this way'?

dan-zeman commented 4 years ago

I am going to close this long issue. If I am not mistaken, classifiers are now described in the documentation, hopefully in line with this thread. The annotation is probably still not fixed in UD_Chinese-GSD but that would deserve a separate issue, I think.