UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

German "ein" ("one") used as a numeral #1061

Closed verenablaschke closed 3 weeks ago

verenablaschke commented 3 weeks ago

In German, the numeral "one" can have the same form as the indefinite article (incl. Being inflected). The German UD guidelines say about this:

The word ein can be either translated as the indefinite article “a” or as the numeral “one”. It is always tagged DET and not NUM, i.e., we do not attempt to distinguish contexts in which the emphasis is on quantity and not on indefiniteness. (The quantity is present in any case, as the indefinite article is never used in plural.) [x]

This causes several inconsistencies and a validator complaint:

  1. The new leaf-det-clf validation rule (https://github.com/UniversalDependencies/docs/issues/1059) complains about structures where “ein” is modified. For instance, the German HDT treebank contains sentences like “Dieses Vergehen könne mit bis zu einem Jahr Haft oder einer Geldstrafe geahndet werden.” (“This offence can be punished with up to one year in prison or a fine.”) where “bis zu”/“up to” modifies “einem”/“one”, and “einem” modifies “Jahr”/“year”. Treating “einem” purely as a determiner leads to a determiner being the head of a dependent. (HDT also contains extremely similar structures that are clearly marked as numerals, e.g. “Ihm droht nun eine Gefängnisstrafe von bis zu fünf Jahren [...]” “He is now facing a prison sentence of up to five years” -- annotated with the same tree structure, but “fünf”/“five” is a NUM/nummod.)
biszueinemjahrhaft
  1. We also find sentences where “ein” is directly contrasted with other numbers, e.g. “Beide Auftritte bleiben laut Koch noch ein bis zwei Wochen im Netz .” (“According to Koch, both performances will remain online for another one to two weeks.”), which are currently treated rather unintuitively. “ein” is the determiner of “Wochen”/“weeks”, “bis”/”to” is treated as a modifier of “Wochen” and “zwei”/“two” is analyzed as a numeral that exists independently of any “one to two” structure. It would be more intuitive to treat “ein” as numeral and “ein bis zwei” as a phrase.
einbiszweiwochen
  1. It’s even possible to think of sentences where a DET vs NUM analysis makes a difference in meaning: “Es dauert nicht nur eine_NUM Minute (sondern zwei Minuten) / Es dauert nicht nur eine_DET Minute (sondern eine Stunde).” (“It doesn’t take only one minute (but two minutes). / It doesn’t take only a minute (but an hour).”)

  2. As a side note, both Dutch treebanks have plenty of entries where “een” is tagged as NUM, and all three Swedish treebanks have instances of “en” or “ett” as NUM.

Can we relax the strong requirement of “ein(e)” needing to be a determiner in German UD analyses?

nschneid commented 3 weeks ago

It seems to me that there will be some cases where one tag or the other is more intuitive, but there may be a lot of gray area in between. Do other German treebanks make a distinction, and if so, what tests do they give?

(I don't know if an analogy to English one is helpful because it cannot be an indefinite article, but there are 3 different tags that can apply.)

LeonieWeissweiler commented 3 weeks ago

GSD has one occurrence of "ein" tagged as NUM (in an unamibigous context as described above) but also several validation errors because of numeral "ein" tagged as DET. The other two have no "ein" as NUM.

verenablaschke commented 3 weeks ago

The other German treebanks follow the language-specific guidelines as well, with the one exception Leonie pointed out: GSD sentence train-s4486 "Die Behaarung besteht aus ein - oder vielzelligen und nichtdrüsigen oder aber mit einem ein - oder mehrzelligen Drüsenkopf versehenen Trichomen." ("The coat of hair consists of uni- or multicellular and non-glandular trichomes or trichomes with a uni- or multicellular glandular head."). Curiously enough, the first "ein" is treated as a NUM and the second one is treated as a DET although the context looks basically identical (I don't think there is a difference between "mehrzellig" and "vielzellig" (both: "multicellular", literally "multiple/several-celled" and "many-celled"), but I can't say for sure). Either way, in both cases "ein" only appears on its own because of a truncation.

amir-zeldes commented 3 weeks ago

+1 for distinguishing NUM from DET in unambiguous environments, if it's possible to implement... I guess when it's modified like that it's a clear indication.

gossebouma commented 3 weeks ago

Note that the Dutch een/NUM examples are all cases where the lemma is "één" and are also pronounced as such ( /eːn/ ). The determiner 'een' is pronounced /iːn/. The een/één cases are instances of sloppy spelling or older corpus data where the diacritics were not preserved. Thus, the Dutch een/NUM cases can be easily identied on the basis of pronounciation.

Stormur commented 3 weeks ago

I think that the current guidelines for ein in German make sense and that introducing a distinction between NUM and DET would introduce an arbitrary variation, as there is really nothing in morphosyntax which can determine a difference. This seems to be a very common occurrence in languages.

We could content ourselves by observing that NUM is in fact a peculiar subclass of DET. Then I think that cases like bis zu einem Jahr can be rather uncontroversially treated with bis zu modifying the head, i.e. Jahr. In a very similar manner as one would treat auch ein Jahr, or nur ein Jahr, etc. , that is, the scope is the whole phrase.

The case of ein bis zwei Wochen is more interesting because of the missing agreement of ein with Woche, but I can envision this can be treated as a case of ellipsis. Now, the unwieldy thing here is that this is a "right-pending ellipsis".

LeonieWeissweiler commented 3 weeks ago

I'm skeptical about "ein bis zwei" being an ellipse. But even if we analyse it that way, IMO it only illustrates a parallel structure where both numbers are NUM.

I'm also skeptical about NUM as a subclass of DET -- and would entirely disagree with any interpretation that would also result in re-annotating "zwei, drei, vier, ..." as DET.

I would love to get more opinions on how to resolve this in accordance with the general guidelines (@nschneid @amir-zeldes @jnivre @dan-zeman ), ideally so we can ensure HDT passing all validator checks before the upcoming data freeze.

nschneid commented 3 weeks ago

For the first screenshot, I don't understand the nmod relation to an ADP. Normally adpositions attach as case.

As a general matter, I think of NUM as really a semantic category whose syntactic distribution is a hybrid of DET and NOUN. (You might also say that nummod is effectively a subtype of det, in some languages anyway.) As for how to apply it, it seems like a practical matter: if there is no history of treebanks separating NUM from DET for ein then it seems like a lot of work to implement the distinction across the board.

amir-zeldes commented 3 weeks ago

I think of NUM as really a semantic category whose syntactic distribution is a hybrid of DET and NOUN. (You might also say that nummod is effectively a subtype of det, in some languages anyway.

Exactly, I think it's very language specific and we shouldn't base too much on how German or English work.

I would love to get more opinions on how to resolve this in accordance with the general guidelines (@nschneid @amir-zeldes @jnivre @dan-zeman ), ideally so we can ensure HDT passing all validator checks before the upcoming data freeze.

For German I think it's usually ambiguous for "ein", and it's fine to assume DET until there is reason to do otherwise. For "zwei" etc. I think the general guidelines would lead users to expect nummod.

if there is no history of treebanks separating NUM from DET for ein then it seems like a lot of work to implement the distinction across the board

I agree it would probably take some manual inspection, but maybe some basic queries could catch most cases:

LeonieWeissweiler commented 3 weeks ago

@amir-zeldes 's idea with the queries is basically what we are proposing. We could say that "ein" is fully disambiguated as NUM in these contexts where you can tell from the dep tree, and discourage manual annotation as NUM based on someone's interpretation of the sentence (which is fuzzy and often would require pragmatics and more context).

Regarding @nschneid 's comment about the case relation, would the resulting chain of two case relations be ok?

Stormur commented 3 weeks ago

So, I see much skepsis here and I thought I could elaborate a bit further.

I'm also skeptical about NUM as a subclass of DET -- and would entirely disagree with any interpretation that would also result in re-annotating "zwei, drei, vier, ..." as DET.

This is actually implied by the guidelines when it is stated "Note that cardinal numerals are covered by NUM whether they are used as determiners". It also makes sense: numerals are very specialised elements conveying just a precise numeric quantity (as opposed to indefinties, say).

So, in the current state of annotation I would not vouch at all for labelling zwei, drei etc. as DET. Simply, DET is the superset of NUM; when we have an element showing ambiguities like ein, or better, a more general sense than a cardinal numeral, then I think it is better to associate it to the more general class. And since in general it is also better to just stick with one POS per lexeme, in my opinion the best choice is to give ein the label DET always. Then we can play with morpholexical features like NumType or NumValue. There are other similar cases like beide, for which probably the non-labelling as NUM is less controversial.

I'm skeptical about "ein bis zwei" being an ellipse. But even if we analyse it that way, IMO it only illustrates a parallel structure where both numbers are NUM.

Again, if we acknowledge DETNUM, there is no problem with their co-ordination. By the way, we can have co-ordination between different POS in certain contexts, such as ADJ with VERB in participial forms. Here the commonality is having a (possibl) numeric value.

I would say that this ellipsis is exactly what we would expect from a parallel structure. I do not think we would want to attach the two arguments to each other in a sentence like

ein bis zwei Wochen is really the same. There is even a further "ellipsis" at morphological level, and the lack of a preposition is in line with how temporal arguments are expressed in German.

I would love to get more opinions on how to resolve this in accordance with the general guidelines (@nschneid @amir-zeldes @jnivre @dan-zeman ), ideally so we can ensure HDT passing all validator checks before the upcoming data freeze.

The proposals above are in accordance with general guidelines indeed.


As a general matter, I think of NUM as really a semantic category whose syntactic distribution is a hybrid of DET and NOUN.

But you can say this for modifiers (ADJ/DET) in general for a language like German (and most European languages), where "adjectives are nouny" (cf. Intransitive predication by Leo Stassen). So it does not tell us anything particular about numerals.

Stormur commented 3 weeks ago

I think of NUM as really a semantic category whose syntactic distribution is a hybrid of DET and NOUN. (You might also say that nummod is effectively a subtype of det, in some languages anyway.

Exactly, I think it's very language specific and we shouldn't base too much on how German or English work.

Sorry for being terse, but this is not the correct way of tackling this problem. We are observing an extremely common pattern at work here, and not acknowledging this while instead resolving to "language specificity" makes each annotation just collapse into an idiosyncratic formalism.

"Language specific" is not the magic answer to everything.

jnivre commented 3 weeks ago

Sorry to be late to the party. Swedish is exactly like German in that the numeral meaning "one" and the singular indefinite article are homographs in writing (and only disambiguated by stress in speech). In the Swedish treebanks, we try to uphold the distinction, but in practice this probably means that the default annotation is article (DET/det) and the numeral annotation (NUM/nummod) is used only when it is clear from the context. I think this is a reasonable compromise.

LeonieWeissweiler commented 3 weeks ago

An attempt to summarise the discussion so far:

To throw another problem in the mix, we would even then be left with two validation errors for s51095 and s68307, which look like this:

Screenshot 2024-10-29 at 10 59 34
nschneid commented 3 weeks ago

To add another complexity, if "ein bis zwei" is like "one to two" in English, expressing a range, I am tempted to view it as coordination, though that's not how we've been analyzing it in UD. :)

LeonieWeissweiler commented 3 weeks ago

It is like "one to two" in English (sorry, should have glossed). Analysing it as a coordinatino would make sense to me and remove the validation error, I think, but if I understand you correctly that would be in violation of general guidelines?

nschneid commented 3 weeks ago

You would have to decide whether "bis" can be tagged as a CCONJ in German. We have not analyzed "to" that way for English in such constructions, though I think it's debatable.

amir-zeldes commented 3 weeks ago

I'm sorry, can someone explain again why this wouldn't be NUM? If it's like "one to two" in English then IMO it should be:

nschneid commented 3 weeks ago

NUM seems very reasonable in principle. I'm just surprised it was never implemented in the first place. Do all German treebanks use a different tag for "ein" vs. "zwei" when they're in coordination? TIGER and so on?

nschneid commented 3 weeks ago

Actually I'm noticing in HDT that some of them have xpos=CARD, while others have ART: https://universal.grew.fr/?custom=67212d1a252d4 Which xpos is correct?

LeonieWeissweiler commented 3 weeks ago

I'm not sure about TIGER, but out of the three other German UD treebanks, two don't have instances, and GSD has the same problem (using DET for the "ein" and NUM for the "zwei").

In checking this, I noticed that we actually already have two instances of "ein bis zwei" in HDT where both are NUM (so it's already inconsistent!), and that even when "ein" is DET, there are two different structures with which this is annotated (one found in @verenablaschke 's screenshot at the top, and one in my recent screenshot).

In total, there are currently 6 instances of "ein" as NUM in HDT. The four not accounted for by "ein bis zwei" are weird artefacts where the "ein" was capitalised in the middle of the sentence, and one occurrence of "Ein ums andere Mal" meaning "time after time", literally "ein upon other time".

But to answer the question, I think it wasn't implemented because "ein bis zwei" and other unambiguous NUM contexts are pretty rare.

Stormur commented 3 weeks ago

What could be context where ein can be unambiguously annotated as NUM?

Making this choice depend on a co-ordination with a NUM like zwei brings us to contextual annotation, we do not get much information from it.

LeonieWeissweiler commented 3 weeks ago

I'm not sure I understand the question. The above examples "bis zu" meaning "up to" and "ein bis zwei" meaning "one to two" are contexts where "ein" can be unambiguously annotated as NUM.

Could you please elaborate what you mean with "contextual annotation, we do not get much information from it"?

verenablaschke commented 3 weeks ago

@nschneid:

Do all German treebanks use a different tag for "ein" vs. "zwei" when they're in coordination? TIGER and so on?

The original version of HDT (pre conversion to UD) is tagged with the STTS tagset, the guidelines for which make an explicit distinction between "ein" used as a cardinal number (CARD) or an article (ART). They explicitly bring up "ein_CARD bis zwei Millionen" (one to two million) vs. "eine_ART Million" (one million) as an example. TIGER uses lightly modified STTS tags, but the modifications don't concern numerals or determiners (Appendix 1+2). The paper about the HDT->UD conversion doesn't mention numerals, and the paper it cites as basis for the POS tag mapping doesn't discuss them in detail either, but the STTS:UPOS correspondence table on the last page doesn't contain any instances of CARD:DET.

HDT still retains the STTS tags as XPOS, and currently contains 72 words with the lemma "ein" and the XPOS "CARD" that seem to fall into three categories based on a quick look: 1. "ein bis/oder zwei" ("one to/or two") as discussed above, 2. "ein Zoll hoch" ("one inch high") -- I would say: NUM, 3. "ein und derselbe" ("one and the same") -- a MWE where an annotation of DET CCONJ DET seems reasonable.

GSD has two cases of "ein_CARD", one is straightforward ("ein Uhr nachts" = "1 AM") and one has a misspelled word form that has the wrong XPOS tag ("eines" ("a.GEN") misspelled as "eins" ("one"; a word form that can't be used as an article)). The much smaller PUD and LIT don't have any hits (LIT has automatically annotated STTS XPOS tags and no occurrences of "ein"+CARD. PUD has PTB(?) XPOS tags and no hits for lemma="ein"&XPOS="CD".)

My take-away is that using the XPOS tags should make it quite easy to identify most of the unambiguous cases in all of the German UD treebanks (to the extent that they even occur).

Actually I'm noticing in HDT that some of them have xpos=CARD, while others have ART: https://universal.grew.fr/?custom=67212d1a252d4 Which xpos is correct?

The cases with XPOS=ART/PIS are interesting. Nearly all of them are in contexts like "ein oder zwei" ("one or two") or "ein, zwei oder drei X" ("one, two, or three X") where we seem to have an actual ellipsis -- the "ein" inflects for the noun following the number sequence as if the noun were in the singular (e.g., "mit einer oder zwei CPUs" "with one/INDEF.SG.F.DAT or two CPU.F.PL.DAT"). The CARD instances aren't inflected: "in ein bis zwei Tagen" ("in one or two day.M.PL.DAT"). The uninflected/non-elliptical version of the first construction sounds good to me ("mit ein oder zwei CPUs"); the elliptical version of the second example ("in einem bis zwei Tagen") sounds awkward to me, but I can't really say why.

Stormur commented 3 weeks ago

I'm not sure I understand the question. The above examples "bis zu" meaning "up to" and "ein bis zwei" meaning "one to two" are contexts where "ein" can be unambiguously annotated as NUM.

Could you please elaborate what you mean with "contextual annotation, we do not get much information from it"?

As far as I have udnerstood, the intention would be to consider ein to be a DET by default, unless some specific contexts trigger its annotation as NUM. Such contexts need to be clearly identifiable.

One of these contexts would be the correlation with a pure numeral like zwei, or the presence of bis zu, which is assumed to modify only ein and not the whole phrase. Now, this makes for a mechanical annotation where the exact label of ein is determined predictably by the context. From another point of view, we are "forcing" ein to be NUM in certain contexts. This first of all creates a weird situation in which 90% of the occurrences of ein are DET, while the rest is NUM, and this is already strange if we consider it to always be the same lexeme (and in my opinion the burden of proof would be to show that we have two different lexemes, not the opposite). But more importantly, we are back-projecting syntax into the POS layer, i.e. we are deciding that NUMs can only be co-ordinated with other NUMs, and so if something is co-ordinated with zwei it needs to be a NUM. This is not informative anymore because we are forcing homogeneous structures, we know what to expect. By the same logic we should then always annotate as ADJ participial forms co-ordinated with adjectives, or use an alternating annotation DET/PRON for some elements like ger. dies just based on whether it appears with a noun or not.

In all such cases we are cancelling information because we are linking together two annotation layers (POS and syntax) which should actually be as orthogonal as possible. We cannot ask anymore questions like "what is the distribution of determiners as head of their head?" (and ellipsis gets also ignored) or "how often are elements from different word classes co-ordinated, and when is this possible?". Now, admittedly, the case of ein is trickier because, as discussed previously, one class contains the other (DETNUM). Another forgotten issue here is that we are ignoring the level of morpholexical features, implicitly making the POS NUM corresponding with the expression of a numeric value, which does not seem the case when looking at words like beide, Paar or drittens.

I do not know if I managed to express my concern well, but I can also point to the section 2.2.2, especially p. 262, of the 2021's introductory paper on UD: " the part-of-speech classification is most useful if it captures regular, prevailing syntactic behavior and does not reflect sentence-specific exceptional behavior. If the POS category were completely predictable from the syntactic function (which is an independent part of UD annotation), then the POS tag would be uninformative".

LeonieWeissweiler commented 3 weeks ago

I think we've arrived at a point where most of us agree, and in light of the imminent data freeze, I'm going to close this issue. While the ART/CARD annotation in the xpos is not perfect and there are false positives and false negatives, we nevertheless see the distinction in the STTS as enough of a justification to introduce this distinction in the German usage of DET and NUM. This has the added benefit of bringing German more in line with other Germanic languages like Swedish. For the kinds of statistics that @Stormur has in mind, as NUM could be considered a subset of DET, one could envision those statistics treating NUM and DET equally.

We will make the following changes: