UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 246 forks source link

Improve cross-language consistency of dates #210

Closed manning closed 8 years ago

manning commented 8 years ago

While I think it would be very good to improve consistency of linguistically interesting constructions, in some ways an easy and more annoying issue is inconsistency in formulaic special constructions. One such construction is dates. @danqi observed that they're very inconsistently annotated.

Here's some data. (Sorry if I've messed something up in the description of what happens in some language; I tried to do my best.)

English: if present, we took the month name as the head. We treated it as a proper noun. We always treat pure digit strings as NUM. We were actually inconsistent in the dependency for years, but I think we wanted to choose a different dependency from the one used for the day number (which you may or may not think is necessary). I certainly don't want to argue strongly for this analysis, but what we did was:

1 November PROPN 13 nmod 2 5 NUM 1 nummod 3 , 1 punct 4 1999 1 nmod:tmod OR appos (!)

A possible reason for having a rule like 'prefer month as head if present' is that you get the same dependencies when the text is next time "5 November 1999" (as we write in Australia).

French: is similar (overlapping personnel?!?), but the month is treated as a NOUN, perhaps for no better reason than capitalization, and both numbers are now nmod:

1 Le DET 2 det 2 27 NUM 3 nmod 3 mai NOUN 8 nmod 4 2011 NUM 3 nmod

German: Has month as proper again and the year as nmod:

1 August PROPN 11 nmod 2 2009 NUM 1 nmod

Indonesian: Uses nummod (like English) but for both the day and the year (unlike English):

1 tanggal NOUN 2 30 NUM 3 nummod 3 Desember PROPN 1 nmod OR name (!) 4 1967 NUM 3 nummod

(As shown, while the whole date is often a nmod, there are multiple cases where it gets a name dependency. That's unrelated, but looks wrong....)

Italian: Takes the the day number as the head, back to NOUN, and uses nummod for year.

1 Il DET 2 det 2 28 NUM 5 nmod 3 novembre NOUN 2 nmod 4 1992 NUM 2 nummod 5 annuncia VERB

Finnish: If I have it straight, if a year is mentioned, like "year 1965", then that is done left headed with a nummod dependent, but if a full DMY dates is given, then it is done with compound, but is chained and right-headed (naughty, naughty!):

1 Vuonna NOUN 11 nmod 2 1965 NUM 1 nummod

1 13 NUM 2 compound 2 päivänä NOUN 3 compound 3 toukokuuta NOUN 4 compound 4 2002 NUM 11 nmod

Danish: Has the month as head again, uses dep (!) and nmod for the two numbers, and this time, unlike French, the DET is a dependent of the month not the day number. Also, the day number is an ADJ not a NUM:

1 den DET 3 det 2 28. ADJ 3 dep 3 juli NOUN 8 dep 4 1992 NUM 3 nmod

Sorry that I didn't get through all the languages. Some had no full dates (e.g., Bulgarian, Greek), some were beyond my skills, and well ... as you might guess from the examples above, I got lazy and did no languages after "I" in the alphabet....

So, if you've read this far, do you think we can agree on a neutral acceptable way to annotate dates?

jnivre commented 8 years ago

Swedish follows what seems to be the majority view and takes the month as the head, which in our case is tagged as NOUN, perhaps because like the French we don't capitalize the names of months. A preceding date number is NUM/nummod, as is a following year. Thus:

1 1 NUM 2 nummod 2 januari NOUN 10 nmod 3 1972 NUM 2 nummod

If the date + month is followed by a descriptive NP like "next year", the latter is nmod instead:

1 1 NUM 2 nummod 2 januari NOUN 10 nmod 3 nästa DET 4 det 4 år NOUN 2 nmod

If the noun "år" (year) occurs before the number of the year, we treat the number as the head and the noun as an nmod of it. In this way we get parallelism with the first example above:

1 1 NUM 2 nummod 2 januari NOUN 10 nmod 3 år NOUN 4 nmod 4 1972 NUM 2 nummod

Based on Chris's survey and arguments, it seems we should at least agree on:

  1. Taking the month as the head of date expressions.
  2. Taking the day number as a nummod of the month.

Possibly we should also agree on:

  1. Using the tag PROPN for names of months, regardless of language-specific orthographic conventions.
  2. Using the relation nummod for the year dependent on the month.

While I can see the convenience of having different relations for the day and the year, I don't think nmod:tmod is more motivated for the year than for the day. Usually, it is the whole date expression that functions as nmod:tmod, and both the day and the year numbers are just nummods within this larger expressions.

dan-zeman commented 8 years ago

Same in Czech, the month is the head. Names of months are not proper nouns in Czech, thus they are tagged NOUN (unless written as numbers, then they are NUM) and I prefer to keep it that way.

na 8. května 1991 “for 8th May 1991”

1 na ADP 4 case
2 8 NUM 4 nummod
3 . PUNCT 2 punct
4 května NOUN 0 root
5 1991 NUM 4 nummod

If the phrase includes the day of the week, it heads the whole date:

středa 2. února “Wednesday 2nd February”

1 středa NOUN 0 root
2 2 NUM 4 nummod
3 . PUNCT 2 punct
4 února NOUN 1 nmod

If there is an interval of dates, it is analyzed as coordination:

26. – 29. 4. 1994 “April 26-29, 1994”

1 26 NUM 6 nummod
2 . PUNCT 1 punct
3 – PUNCT 1 punct
4 29 NUM 1 conj
5 . PUNCT 4 punct
6 4 NUM 0 root
7 . PUNCT 6 punct
8 1994 NUM 6 nummod
dan-zeman commented 8 years ago

What about Spanish: (el) 6 de noviembre de 1630

The preposition de (parallel to English “of”) makes this example different. It would feel strange not to have the 6 as head in this case. Elsewhere in UD, genitive and “of” modifiers are attached as nmod to the preceding nominal. The dates are also analyzed that way in Spanish UD 1.1 (thus if “month is the head” is our rule, Spanish violates it).

vinczev commented 8 years ago

In Hungarian, we also opted for choosing the day as the head so it also violates the rule of marking the month as the head. We have two reasons for that:

First, the morphological inner structure of dates is that of a possessive construction:

  1. szeptember 10. (or 10-e, both are in use) = 10th of September 1998

On the day, we have a regular possessive suffix (-e) like in "szék" - "széke" (chair - his chair).

Second, when the whole date gets inflected, the case suffix appears on the day:

  1. szeptember 10-én = on the 10th of September 1998

So the Hungarian analysis of such constructions is like this (we apply nmod:att for possessive constructions):

1 1998. 2 nmod:att 2 szeptember 3 nmod:att 3 10-e 0 root

jnivre commented 8 years ago

Okay. So there seems to be a consensus that the year depends on the month, but there are two proposals for the relation between day and month. There seems to be good arguments for taking the day as the head in Spanish and Hungarian. Are there equally good arguments for taking the month as the head in other languages, in which case this will be a case of "metataxis". Or is it more arbitrary in other languages, in which case we maximize parallelism by taking the day as the head.

dan-zeman commented 8 years ago

I think it is language-dependent. In Czech it is more natural to modify the month by the day. You say "the third October"; even though semantically you mean "the third day of October" (as opposed to the third element in a sequence of Octobers), morpho/syntactically it is different. The ordinal number works like an adjective and agrees with the month also in case (třetího října is the most frequent usage, it is the whole phrase "third October" in the genitive case, which is the form used in reply to "when").

osenova commented 8 years ago

The same for Bulgarian – without the “case” agreeing, of course, since we lack cases in the nominal system: трети [‘third’ ordinal numeral behaving as adjective on syntactic level] септември [September]

wbwseeker commented 8 years ago

German does it like Czech and Bulgarian

der zweite Mai literally: the second May the second of May

the ordinal inflects in agreement with the month as in Dan's example of Czech:

am zweiten Mai on the second of May (in dative case)

Am 22.09.2015 um 19:33 schrieb osenova notifications@github.com:

The same for Bulgarian – without the “case” agreeing, of course, since we lack cases in the nominal system: трети [‘third’ ordinal numeral behaving as adjective on syntactic level] септември [September]

From: Dan Zeman Sent: Tuesday, September 22, 2015 7:56 PM To: UniversalDependencies/docs Subject: Re: [docs] Improve cross-language consistency of dates (#210)

I think it is language-dependent. In Czech it is more natural to modify the month by the day. You say "the third October"; even though semantically you mean "the third day of October" (as opposed to the third element in a sequence of Octobers), morpho/syntactically it is different. The ordinal number works like an adjective and agrees with the month also in case (třetího října is the most frequent usage, it is the whole phrase "third October" in the genitive phrase).

— Reply to this email directly or view it on GitHub. — Reply to this email directly or view it on GitHub https://github.com/UniversalDependencies/docs/issues/210#issuecomment-142375476.

dan-zeman commented 8 years ago

So the space for standardization is limited but still larger than zero. I hope we can avoid the compound relation that is currently used in Finnish, @fginter ? The year depending on the month as nummod (or nmod if not expressed numerically) seems also acceptable to me. Then the languages should decide and document whether there prevail linguistic reasons for either month or day as the main head. If there are no good reasons against it, the month could be the default head. Is that reasonable?

mojgan-seraji commented 8 years ago

I also think that this is language-dependent. In Persian, there is ezafe construction (-e) that links the elements in the phrase, indicating the semantic relation between the joint elements, e.g.: in third-e October-e 2015. However, the ezafe is not always marked in text (it is only marked after vowels, I only marked it to show the construction) but it syntactically covers a possessive relation. Although the first element in the ezafe construction is normally the head, the ordinal number is always dependent to the month and functions like an adjective to the month/year.

case(October-3, in-1) nummod (October-3, third-2) root( ROOT, October-3) nummod(October-3, 2015-4)

One more thing, this order is rarely found in Persian. The word "year" is usually placed before the year: in third(-e) October(-e) year(-e) 2015 (in the third of October of the year 2015)

case(October-3, in-1) nummod (October-3, third-2) root( ROOT, October-3) nmod:poss(October-3, year-4) nummod(year-4, 2015-5)

"third" modifies the month "October" and "2015" modifies the "year".

jnivre commented 8 years ago

I think this is a good proposal. The default is:

month -> day month -> year

But if there is clear linguistic evidence that the month is a modifier of the day (in particular, case marking, either morphological or syntactic), then the alternative is:

day -> month month -> year

The relations should be standard modifier relations taking the form into account: nummod if numerical, otherwise nmod (or possibly amod for day if it has adjectival inflection).

msimi commented 8 years ago

Torniamo al giorno come testa?

— Maria

On 23 Sep 2015, at 18:24, Joakim Nivre notifications@github.com wrote:

I think this is a good proposal. The default is:

month -> day month -> year

But if there is clear linguistic evidence that the month is a modifier of the year (in particular, case marking, either morphological or syntactic), then the alternative is:

day -> month month -> year

The relations should be standard modifier relations taking the form into account: nummod if numerical, otherwise nmod (or possibly amod for day if it has adjectival inflection).

— Reply to this email directly or view it on GitHub https://github.com/UniversalDependencies/docs/issues/210#issuecomment-142671084.

SimonettaMontemagni commented 8 years ago

For Italian, we have linguistic evidence supporting the choice of having the day as the head of dates.

Dates in text can take one of the following forms, where the day always precedes the month: 1) il 29 settembre 2015 ‘the 29 September 2015’ 2) il 29 settembre del 2015 ‘the 29 September of 2015’ 3) il 29 di settembre 2015 ‘the 29 of September 2015’ 4) il 29 di questo mese ‘the 29 of this month’

We believe that all these dates should be assigned the same underlying representation.

If we follow the strategy of taking the month as the head and the preceding day number and following year as numeral modifiers, the representation type we obtain in first two cases is OK but for the third and fourth cases it is quite odd.

Compare the following dependency representations. Case 1) 1 il il DET RD Definite=Def|Gender=Masc|Number=Sing|PronType=Art 2 det 2 29 29 NUM N NumType=Card 3 nummod 3 settembre settembre NOUN S Gender=Masc|Number=Sing 0 root 4 2015 2015 NUM N NumType=Card 3 nummod This representation is OK; what is unclear is the head of the article, which could either be the day number (preferable but strange) or the month.

Case 2) in the context of the sentence “è possibile dal 29 di settembre 2015” ‘it is possible from the 29 of September 2015’ 1 dal dal ADP EA Definite=Def|Gender=Masc|Number=Sing 2 case 2 29 29 NUM N NumType=Card 4 nummod 3 di di ADP E 4 case 4 settembre settembre NOUN S Gender=Masc|Number=Sing 0 root 5 2015 2015 NUM N NumType=Card 4 nummod _

In this representation of 2), enforcing the month to be the head gives rise to a quite anomalous syntactic structure, with an unclear role of case which is expected to mark the relation linking its head (September in the case at hand) to the governing element (absent in the toy example below). The relevant case is the one linked to the day.

This case should be treated differently, since in the underlying syntactic structure there is clear linguistic evidence that the day is the head, modified by the month in its turn modified by the year.

1 dal dal ADP EA Definite=Def|Gender=Masc|Number=Sing 2 case 2 29 29 NUM N NumType=Card 0 root 3 di di ADP E 4 case 4 settembre settembre NOUN S Gender=Masc|Number=Sing 2 nmod 5 2015 2015 NUM N NumType=Card 4 nummod _

If we want to assign the same type of representation to all dates, the best solution for Italian is by taking the day as the head. In this way the representation we get is always acceptable.

What about languages similar to Italian?

jnivre commented 8 years ago

This sounds convincing to me, and I think the same arguments apply at least to Spanish. In fact, I think that for most (European) languages, these expressions started out with something like "the Nth day of (this) MONTH", where "day" is clearly the head. The word "day" was later dropped, thereby promoting the day number to the head. This seems to be the current status for Italian and Spanish, even though the preposition can be dropped. Other languages have gone further, so in Swedish for example it is now ungrammatical to insert a preposition: *den 3 av juni, which makes it more natural to take the month as the head, because this is a regular NP structure: det amod noun. So I think we have to accept that this is a case of metataxis, that is, a genuine structural difference across languages, which is exactly what we want to reveal by getting rid of the spurious differences.

dan-zeman commented 8 years ago

Agreed.

Out of curiosity, I just ran a few quick queries over a diachronic corpus of Czech. I found some examples from the 16th century that confirm the hypothesis formulated in Joakim's comment. Like

dne 27 . měsíce ledna “day 27 th month January”

There is no preposition like de in Spanish but everything is in genitive, which is our equivalent of the preposition. So here we would have

nummod(dne, 27) punct(27, .) nmod(dne, měsíce) nmod(měsíce, ledna)

I am going to close the issue now because it seems the situation has been sufficiently explored and consensus has been reached.

dan-zeman commented 8 years ago

@vcvpaiva on Portuguese (copied from https://github.com/UniversalDependencies/UD_Portuguese/issues/1): I re-read the UD documentation, but couldn't find anything about "dates":e.g. the first of May, 2015. clearly the whole thing is a nominal phrase, but how do people treat the components? Because we have several "1º de maio/novembro/setembro" where "1º" is considered an adjective, this doesn't seem right to me.

dan-zeman commented 8 years ago

I think the Portuguese case is largely parallel to Spanish and Italian, except that the day is represented by an ordinal number (adjective) rather than a cardinal. It seems to me that the it/es solution is still the best option here, but we have to accept that the hidden head is dia “day” and that it has been elided.

vcvpaiva commented 8 years ago

Thanks for the reply. Mostly the Portuguese dates are as in Italian/Spanish, as you say. For Guy Fawkes night (5th November), Chris' example, we'd be just like the Italian and Spanish, no problem. It's just that "first/second" dates bring problems of their own, as then we use the ordinal, instead of the cardinal. the the "day" dia has definitely been elided.

vcvpaiva commented 8 years ago

hmm, don't people find it strange that it's a nominal phrase, but the head (following the suggestion of Italian and Spanish) is an adjective?

dan-zeman commented 8 years ago

I think that is what we do elsewhere when a noun is elided and its modifying adjective is promoted to the head position.

vcvpaiva commented 8 years ago

ok, thanks.

perrier54 commented 6 years ago

The conclusion of the discussion is that the syntactic relation between the day and the month is language specific: in some languages, the day is the head and other languages, the month is the head. I disagree with this conclusion and I believe that the day as head of a date is a universal property. The philosophy of Universal Dependencies is that the syntactic annotation must be as closed as possible to semantics. If we consider a date as a single semantic unit, this unit represents a day and in the triplet (day, month, year), the semantic head is the day. Therefore, the syntactic head must also be the day, which does not depend on the language. The only contrary argument I found in the discussion is an argument of agreement. In some languages, the day appears as an adjective that agrees with the month. I don’t know Czech and Bulgarian but for German on may consider that “der zweite Mai” is an ellipsis for “der zweite Tag von Mai”. This interpretation is consistent with the view that the day is the head. I have an additional remark; the use of NUMMOD as the label for the dependency month -> day does not follow the definition of the guide : “A numeric modifier of a noun is any number phrase that serves to modify the meaning of the noun with a quantity.” The day represents no quantity. AMOD would be more appropriate.

jnivre commented 6 years ago

I don't agree that UD annotation must be as close as possible to semantics. In cases of syntactic alternations at the clause level, for example, we consistently follow morphosyntax, not semantics. We treat "the window" as a subject in "the window broke" but as an object in "John broke the window" despite the fact that it has the same semantic role in both cases. The same holds across languages. If one language treats experiencers as subjects, while another treats it as an oblique, this should be captured in the annotation.

Therefore, I think the previous consensus on date expressions, which allows languages to vary with respect to whether the day or the month is treated as the head if there is convincing evidence for one or the other analysis, was a reasonable compromise. However, the idea of analysing "(the) 20th May" as elliptic is definitely worth considering, especially for languages that have a variant with a preposition: "(the) 20th of May". This is similar to variation that is often found in partitive constructions like "half (of) the people" and where the most natural analysis is probably to treat the second element as "nmod" in both cases.

jnivre commented 6 years ago

I suggest that the treatment of dates can be handled by the working group on MWEs, where complex names should also be discussed.

sylvainkahane commented 6 years ago

It seems that during the discussion about dates the question of the POS of day and year numbers was not treated. In French these numbers behave like nouns. The day number cannot be used without a determiner or day name: Je viendrai le 13 / lundi 13 / le lundi 13 / *13 'I will come the 13 / Monday 13 / the Monday 13 / *13'. The year number behaves like a month name: Je viendrais en 2017 / en novembre / en novembre 2013. 'I will come in 2017 / in November / in November 2017'. They can also be subjects (2017 is a good year). To be more precise, year numbers behave as proper nouns (no determiner) and day numbers as nouns in French.

We can label these numbers as NUMs (to avoid polycategoriality), but we must be aware that they occupy nominal positions. Consequently, it is very problematic to encode them as nummod as they are now in UD_English for instance.

Same remark with numbers used as names: le bus 47 / le 47 'the bus 47 / the 47'. See also #466, where we declined to encode similar numbers as nummod.

jnivre commented 6 years ago

POSTAGs were discussed at the beginning of the thread, where it was noticed that there was some inconsistency w.r.t. the use of NOUN or PROPN for month names, for example. But we probably need to revisit it.

CatalinaMaranduc commented 6 years ago

In Romanian, the month names are written in lower case and considered nouns. there are differences between languages.

On Mon, Nov 13, 2017 at 6:23 PM, Joakim Nivre notifications@github.com wrote:

POSTAGs were discussed at the beginning of the thread, where it was noticed that there was some inconsistency w.r.t. the use of NOUN or PROPN for month names, for example. But we probably need to revisit it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/210#issuecomment-343973029, or mute the thread https://github.com/notifications/unsubscribe-auth/AKWSz2dlBw3G2MepHvcXkimsdHL5pv0pks5s2G0egaJpZM4F8JW_ .

amir-zeldes commented 6 years ago

I'm all for making the day head in English, and I agree with @perrier54 that the semantics is not negligible. Especially if multiple syntactic interpretations are possible, I would prefer for the day to be head, since the entire date is a kind of day.

For English, I think such an analysis is possible for various constructions:

In other words, unless the language makes it strongly impossible to do this (e.g. the month takes governed case by a predicate, and the day takes a fixed case modifying the month), I would plead for giving the day the benefit of the doubt and making it the head if possible.