nummod should be NUM but it is NOUN: validation error in larger number phrases

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

272 stars 247 forks source link

nummod should be NUM but it is NOUN: validation error in larger number phrases #596

Closed olesar closed 5 years ago

olesar commented 5 years ago

In Russian UD treebanks, the lemmas тысяча 'thousand', миллион 'million', миллиард 'billion' (and names for even larger numbers) are treated as a NOUN. The reason is that they have a category on Number which is not present in NUMs: (1) пять тысяч.Plur евро 'five thousand.Plur euro' (2) на тысячу.Sing человек 'on one thousand.Sing people' (3) две тысячи.Plur грантов 'two thousand.Sing grants'

and, unlike numerals, can be used with and without один 'one': (4) тысяча долларов = одна тысяча евро '(one) thousand euro'

They have a paradigm structure and endings similar to the regular classes of substantives. See also some semantic effects of using these words in Plural similar to that known in substantives: миллионы.Plur евро 'many many many millions euro, not applicable to 2 million euro)'.

Typologically, we can find different systems where nouns (the names of physical objects) are used as numerals as well (and probably classifier names used as numerals?).

An alternative decision could be: (i) to label 'thousand', 'million' etc. as NUM in numerical phrases and NOUN in other cases; (ii) to label them NOUNs in numerical phrases and assign nmod. We had some experience with (i) while annotating manually non-UD Russian corpora and I should say that there are a lot of borderline cases here in real texts. As for (ii), it would be a bit strange to have nmod inside complex numeric phrases such as one million two thousand three.

So, I would go for making the validation tool less restrictive in this case and allow NOUNs to be nummod.

In general, I would prefer to keep UD pos-tags less mirroring dependency relations (making parsing more challenging, yes) and rather reflecting morphosyntactic properties of morphology-rich languages (paradigm structure, declension type, type of behavior in specific patterns etc.).

dan-zeman commented 5 years ago

I am not necessarily against changing the guidelines but I believe that the validator follows the current guidelines in this respect. The definition of nummod refers to the NUM UPOS tag. It does not seem to cover quantifying expressions that are not NUM (specifically excluding pronominal quantifiers, which are tagged DET). So this seems to be (yet another) example where POS tags and relations are tightly dependent on each other.

Looking forward to hear what others think. I know there are proposals to collapse several *mod relations into a simple mod, but that is something that we can hardly do without upgrading to UD v3 guidelines.

amir-zeldes commented 5 years ago

I also think it may be too restrictive to limit nummod to NUM - in many languages you can get nominalized, morphologically viable numeral nouns, which are not the same as pronominal quantifiers. For cases like pluralizable 'dozens' or 'hundreds' modifying another noun, I think it makes sense for them to be NOUN (since the normal numeral 'hundred' shouldn't be able to pluralize), but also nummod, at least in languages where you don't need an 'of' after such number nouns. This admittedly unusual mismatch between nummod and the POS is what would allow us to find such unusual constructions, so I'm for allowing this combination.

Incidentally, this use case is also one where reducing all modifiers to *mod would result in a loss of information.

lauma commented 5 years ago

To add Latvian point of view: we also treat large numbers like quadrillion (kvadriljons), quintillion (kvintiljons) and some other quantities like zero (nulle), [one] half (puse), [one] third (trešdaļa) to be nouns, so this issue has potential to impact Latvian, also. Latvian Treebank did this this way, because "normal" inflective numerals can have both masculine and feminine gender, e.g., četri krēsli (four chairs, masc.), četras lampas (four lamps, fem.), deviņi, deviņas (nine), but zero, half and large someting-illions can't. However, there is somewhat of a struggling around words hundred (simts) and thousand (tūkstotis) because their feminine forms are rare and considered obselote in contemporary language.

dan-zeman commented 5 years ago

@amir-zeldes : you do not need "of" but you need the genitive morphological case. And it is actually the counted noun, not the number, who is forced to genitive.

For the record, the Czech morphological analyzer also tags tisíc "thousand", milión "million" and miliarda "billion" as nouns (tisíc only in certain contexts I think). But it does not result in conflict with nummod in the Czech treebanks because the original data uses only one adnominal relation type, and the only way how to decide between nmod and nummod is to look at part of speech. Thus in 72 milionů dolarů "72 million dollars", we have milionů as the head of the phrase, and the dependencies are nummod:gov(milionů, 72) and nmod(milionů, dolarů).

BTW, recently I found out that the Polish tradition is to regard jeden "one" as adjective rather than numeral (because it inflects for gender and number; this is unlike in Czech and I believe Russian, where it is tagged NUM despite having the same inflectional capability) [@adam-przepiorkowski @agnieszka-patejuk]. I told them that then they probably have to attach it as amod. But if the guideline is made more flexible and extended to nouns, then it would probably make sense to allow these adjectives as well.

gossebouma commented 5 years ago

In the Dutch treebanks, nummod is either a NUM or SYM. The latter is used for the first token in strings like '25% insuline', '1,5x1,5 centimeter', '+11,77 km' and '3x Roland Garros'.

The guidelines for SYM suggest that perhaps only the mathematical symbol in these tokens should be SYM and the rest NUM, but the underlying treebank treats these expressions as a single token (as they are written without space). So maybe SYM could be allowed for nummod as well?

msklvsk commented 5 years ago

What about N times? In Ukrainian, N is a foreign X and there is a nummod(times, N). Validator complains.

dan-zeman commented 5 years ago

What about N times? In Ukrainian, N is a foreign X and there is a nummod(times, N). Validator complains.

Hmm, tough call. I am not sure that N is a numeral. Semantically it is closer to indefinite quantifiers that often end up as DET in UD, but it is not a normal pronominal quantifier either. In any case I think that you have to decide whether N is foreign and unanalyzable or not. If you attach it via nummod then you are suggesting that you know it is a numeral, thus it is not unknown and it should not be tagged X.

jnivre commented 5 years ago

I agree. The general recommendation is to provide a real analysis whenever possible, and to opt out using X and flat:foreign only when annotators cannot interpret the foreign elements. I would also expect "nummod" to imply a different postag than X.

EmanuelUHH commented 5 years ago

We have a similar problem in German. We also treat numerals like "Hundert", "Tausend" etc. as nummod with the POS tag NOUN and support @olesar's request to make the validation tool less restrictive.

For the record, the Czech morphological analyzer also tags tisíc "thousand", milión "million" and miliarda "billion" as nouns (tisíc only in certain contexts I think). But it does not result in conflict with nummod in the Czech treebanks because the original data uses only one adnominal relation type, and the only way how to decide between nmod and nummod is to look at part of speech. Thus in 72 milionů dolarů "72 million dollars", we have milionů as the head of the phrase, and the dependencies are nummod:gov(milionů, 72) and nmod(milionů, dolarů).

@dan-zeman: I kind of get how dollars as a unit can be an nmod of million, but I'm not entirely convinced of this analysis. Isn't the numeral more a part of the number than of the unit?
We can write that phrase in several different ways:

Zweiundsiebzig Millionen Dollar 72 Millionen Dollar 72,000,000 $

Despite slight variations in representation, all of these phrases refer to exactly the same thing, namely a certain amount of money. When thinking about phrases like this (x amount of y) in a sentence context, the information about them most relevant to their place in the sentence seems to be thing of which there is something - in this case it would be money. The fact that we are dealing with money is only represented by the unit. The number (or the combination of number and numeral) merely specifies the unit's amount. Thus I'd argue that it makes most sense to use the unit as the head.

sylvainkahane commented 5 years ago

Again and again we have the same difficulties when deciding between analyses because UD always hesitates between semantic and syntactic criteria. Both criteria are interesting and important and both analyses are suitable. But we cannot decide because we don't have clear criteria to decide. The choice to favor relations between content words is not based on syntactic/distributional criteria and until this will be maintained in UD we will have hesitations/confusions between syntactic and semantic analyses. (Of course that's my personal opinion and it is why we proposed the Surface-Syntactic UD annotation at UDW 2018.)

The example of expressions such as 72,000,000 $ is a very good illustration of the problem. From the semantic point of view, it is clear that the money unit must be taken as the head. If I pay something 72,000,000 $ it means that I give to someone dollars (and not 72,000,000, which doesn't mean nothing). So we all agree that at a certain level of analysis we want a relation between 'pay' and 'dollar', but this is a semantic relation (Mel'cuk 1988, 2010). If we consider the syntactic point of view, in many languages 72,000,000 $ doesn't work like 72 $. For instance, in French, we have :

soixante-douze dollars
soixante-douze millions de dollars

The first expression behaves like DET/ADJ NOUN constructions (quelques dollards 'some dollars'), while the second one contain a noun complement (une grande quantité de dollars 'a big amount of dollars_). For this latter construction, 'dollar' is still the semantic head, but it becomes very difficult to propose a syntactic analysis where quantité 'amount' is not the head.

dan-zeman commented 5 years ago

I would like to unravel and close this issue. I am afraid that the main problem is that the nummod relation is insufficiently motivated and defined, but we cannot simply get rid of it as long as we work under v2 guidelines. If the motivation were semantic, then it would probably include pronominal quantifiers as well as noun phrases like a bunch of — but these are clearly not included. And, UD is a syntactic, not a semantic framework. If the motivation is syntactic, then one could ask whether we really need to distinguish nummod from det and amod, especially when numerals can still be identified by the NUM tag.

Regardless the motivation, I interpreted the nummod relation as being reserved for dependents that are tagged NUM. The nummod guidelines indeed refer to the documentation of NUM; however, the actual wording around the link is “any number phrase that serves to modify the meaning of the noun with a quantity.” A number phrase is perhaps a wider term than a numeral, if language-internal criteria lead to tagging the number as a noun.

Unless I overlooked something, several people in this thread argued for making the validator more benevolent while nobody voiced a strictly opposite stance. So I propose to allow NUM and NOUN (and maybe SYM?) as nummod dependents, but nothing else. Language-specific documentation has to define how to reliably figure out that a number is to be tagged NOUN (if it is allowed at all in the language).

jnivre commented 5 years ago

Thanks, Dan. I personally think that "nummod" is similar to the old "neg" relation from v1 in that it is not a proper syntactic relation. For future versions of UD, we may consider going the same way, that is, to remove the nummod relation (and use det, amod, advmod, etc. as appropriate) and instead have a feature marking phrases as numerical or quantitative (if the postag NUM is not sufficient by itself). But as long as we are operating under v2, we have to stick to the current guidelines and what you propose sounds good to me.

dan-zeman commented 5 years ago

Validator modified. Closing this issue.

rueter commented 5 years ago

Sorry, for the late interest, but I wasn't sure that I understood the argumentation as to why the Russian numerals тысяча 'thousand', миллион 'million', миллиард 'billion' (and names for even larger numbers) are treated as a NOUN and NOT NUM.

The reason is that they have a category on Number which is not present in NUMs: (1) пять тысяч.Plur евро 'five thousand.Plur euro' (2) на тысячу.Sing человек 'on one thousand.Sing people' (3) две тысячи.Plur грантов 'two thousand.Sing grants'

What about the hundreds, which by orthographic tradition are written as single words but, in fact, satisfy part of the same criteria (1–3) (1) пять|сот (Genitive plural) евро 'five-hundred euros' (2) на сто.Sing человек '?on one-hundred people(I don't know what this is supposed to mean)' (3) две|сти (special form, perhaps dual?) евро 'two-hundred euros'

and, unlike numerals, can be used with and without один 'one': (4) тысяча долларов = одна тысяча евро '(one) thousand euro'

And, yes, the word сто 'hundred' cannot not be used with the the numeral одно 'one', so (3) and (4) illustrate where the larger numerals diverge from the hundreds, which in turn diverge from the teens and the tens. (три|ста, четере|ста, пять|сот, шесть|сот)

They have a paradigm structure and endings similar to the regular classes of substantives. See also some semantic effects of using these words in Plural similar to that known in substantives: миллионы.Plur евро 'many many many millions euro, not applicable to 2 million euro)'.

To achieve the same effect we have to use numeral derivations 'ten' десят >> десятки евро 'tens of euros', 'hundred' сто >> сотни евро 'hundreds of euros', but then 'thousands of euros' is expressed with тысячи евро and analogue of the example given for 'millions' above. This sounds closer to etymological source finding. Is this what we are trying to do in UD?

Typologically, we can find different systems where nouns (the names of physical objects) are used as numerals as well (and probably classifier names used as numerals?).

An alternative decision could be: (i) to label 'thousand', 'million' etc. as NUM in numerical phrases and NOUN in other cases; (ii) to label them NOUNs in numerical phrases and assign nmod. We had some experience with (i) while annotating manually non-UD Russian corpora and I should say that there are a lot of borderline cases here in real texts. As for (ii), it would be a bit strange to have nmod inside complex numeric phrases such as one million two thousand three.

So, I would go for making the validation tool less restrictive in this case and allow NOUNs to be nummod.

In general, I would prefer to keep UD pos-tags less mirroring dependency relations (making parsing more challenging, yes) and rather reflecting morphosyntactic properties of morphology-rich languages (paradigm structure, declension type, type of behavior in specific patterns etc.).

I guess I didn't understand how/why a morphologically rich numeral declension system had to be avoided in favor or a morphologically rich noun declension system.

amir-zeldes commented 5 years ago

Let me add my voice to those who might want to keep nummod as a relation: the behavior of numbers in a variety of languages can be syntactically distinct from determiners and nouns, or it can be a mix of both, sometimes also varying based on the number (as in Slavic; Coptic and others also have some unique constructions).

It's true that there is something semantic about numbers, but in my opinion that shouldn't mean we can't maintain a dependency relation motivated by their syntax as well. There are also some further considerations, even in languages in which they could be collapsed with determiners or something else:

If NUM is a POS tag, then we might expect NUM phrases to 'project' some kind of numeral category - I'm guessing we don't want to get rid of the POS tag as well?
It may be useful to compare numeral dependencies across languages, which nummod facilitates
Their combinatorics may be unique even in languages where they are 'normal' in isolation ('the book', and 'two books' look like det, but we also have 'the two books', 'all three books' etc.). In some examples you even get PDT+DT+CD: (not making grammaticality judgments here, these are just attested)
- "double the number of grains each time , until you have covered all the 64 squares on the board"
- "If we want to submit something in color by supplying all the 800 copies you will need could this be done ? "
- "the Australians are top seeds for all the three titles ( Men , Women and Mixed Doubles ) at stake"
UD has a 'semanto-centric' view of headedness, which suggests that individual language guidelines may decide to categorize what looks like a subordinate lexical head as the syntactic head of the phrase, and use the special nummod relation to indicate the construction

In general, I would like to quote a colleague I respect who told me after some drastic guideline changes were being discussed: "you guys doing UD should pick a bad annotation scheme and stick to it". I'm not saying that nummod is a perfectly clear syntactic category, but we've been working with it fine for a while, and I for one think life would be easier if we keep it.

dan-zeman commented 5 years ago

@rueter : I think that there is again a scale.

миллион "million" is semantically a number but morphologically it is a noun, just like регион "region" or район "district". Syntactically it is also like many nouns and it takes genitive modifiers, although these are interpreted as partitives rather than possessives (but there are such nouns too, cf. миллион долларов "million of dollars" vs. куча долларов "loads of dollars").
один "one" is the other extreme and (while also semantically a number) it is very close to adjectives, including gender inflection.
the values inbetween may show some features to various extent and lack others; for example, пятнадцать "fifteen" is inherently plural (while "million" has a singular and a plural form), has no gender inflection and shows heavy case syncretism (only 3 forms for 6 morphological cases); syntactically, it still requires the counted noun in genitive, just like million. In my view, these words are clearly distinct from both nouns and adjectives on morphological and syntactic grounds, so they definitely deserve a category of their own (NUM). Whether "million" should also be NUM or not, and where exactly the borderline is ("thousand"? "hundred"?) is a different and difficult question.

rueter commented 5 years ago

@dan-zeman : NUM seems to be inconsistent in the 4 Russian treebanks

Thanks for the idea of scale, it sounds good.

Let's take a look at the nummod guidelines and their reference to NUM.

(1) There are instances where words are treated as NOUN, but their counterparts in Western Arabic numerals are treated as NUM. words: тысяча 'thousand' : (Taiga=) NUM; (others=) NOUN digits: 5000 : NUM

(2) A determiner is treated consistently as NUM. The equivalent of this determiner is dealt with differently in the treebanks of other UD languages, so it's no surprise that this presentation is given here. Universal quantifier: оба 'both' : NUM

(3) Quantifiers that are not cardinal numerals are labled as NUM instead of DET or ADV. много 'many', сколько 'how many', больше 'more' etc. : NUM

(4) Collective nouns are treated as NUM in the two treebanks I found them in: collection of people: шестеро/четверо : (PUD/GSD=) NUM

I see no problem with allowing a NOUN a nummod relation or a NUM a relation other than nummod, but it is confusing when NUM identifies words that are not cardinals. And it was interesting that the pronounced digits are treated differently from their written equivalents.

dan-zeman commented 5 years ago

(1) Agreed that тысяча "thousand" should be treated consistently across the Russian treebanks. I am not so sure that NOUN is a good solution but consistency definitely matters. As for the Arabic digits, my bet would be that the reason is technical: you tag it NUM and you do not consult a dictionary... but you could also say that if you write 5000, you lose some of the possible usages that are more noun-like, e.g., Thousands protested in London.

(2) оба "both": while it has the pronominal feature of universal quantification, it also contains the definite quantity = 2. I suspect that the traditional grammar may say it is a numeral (that definitely is the case in Czech, so why not in Russian).

(3) This is against the guidelines. (Although it probably also follows the traditional grammar.)

(4) In my view, collective numerals are okay as NUM. But I don't know their Russian usage, I'm just guessing based on Czech.