UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
270 stars 245 forks source link

New validator rule: leaf-det-clf #1059

Open nschneid opened 1 week ago

nschneid commented 1 week ago

I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.

Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:

mr-martian commented 1 week ago

The errors in Hebrew are due to things like

# x- so the RTL text doesn't make this unreadable
32  x-ה x-ה DET art PronType=Art    33  det _   Gloss=the|Ref=GEN_19.8
33  x-אֲנָשִׁ֤ים    x-אישׁ  NOUN    subs    Gender=Masc|Number=Plur 38  obl _   Gloss=man|Ref=GEN_19.8
34-35   x-הָאֵל֙    x-_ _   _   _   _   _   _   _
34  x-הָ    x-ה DET art PronType=Art    35  det _   Gloss=the|Ref=GEN_19.8
35  x-אֵל֙  x-אל    PRON    prde    Number=Plur|PronType=Dem    33  det _   Gloss=these|Ref=GEN_19.8

where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.)

amir-zeldes commented 1 week ago

@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew)

mr-martian commented 1 week ago

If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word.

amir-zeldes commented 1 week ago

I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things.

colinbatchelor commented 1 week ago

I have one remaining error: [(in gd_arcosg-ud-train.conllu) Line 55940 Sent p01_033h Node 79]: [L3 Syntax leaf-det-clf] 'det' not expected to have children (79:a:det --> 81:h-uile:compound)

The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception?

nschneid commented 6 days ago

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

LeonieWeissweiler commented 6 days ago

This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an ADP' and the second is aDET' that depends on it with the `case' relation.

How should we handle this better?

nschneid commented 6 days ago

unter anderem is sometimes treated as a fixed expression. Here is a case triggering the error:

image

I assume this means "among other teachers"—is there a reason not to analyze it as "among [other teachers]", with unter attaching to Lehrer?

amir-zeldes commented 6 days ago

No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others".

FedeIure commented 6 days ago

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

What about flat:redup to mark repetition for emphasis?

Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE:

flat_redup_Latin_CIRCSE
sylvainkahane commented 6 days ago

For spoken data, we need three relations to be added to the validator:

lrituma commented 2 days ago

In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as discourse which is dependent of pronoun, and pronoun occasionally becomes det if the expression describes a noun. This leads to validation error.

The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs.

We would like to annotate these expressions as compound (instead of fixed) because the pronoun is the second element in the phrase and we feel that it is the head of the phrase because the pronoun inflects together with a noun and bears the most of semantic meaning of the expression.

Would you please consider allowing compound in this construction or is there any other option appropriate here?

nschneid commented 1 day ago

@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule?

Stormur commented 45 minutes ago

I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.

  1. The already mentioned reduplication, which is treated through flat:redup in Latin treebanks. One example is quot quot from quot: while the latter means 'as many as', the reduplication has a distributive sense as in 'for each possible one...' (this expression is sometimes even univerbated). I think to annotate them separately, each depending on the head, is not the right way to deal with them: here we do not have two or more different terms, but really the same one "clonating" itself. On the other hand, flat is really the closest relation we have to fixed, which would cause no problem, but is not a correct choice is (well, in my opinion it is never the correct choice)
    • Problem: horizontal relation
  2. The phrase nostra qui remansissemus caede 'the murder of us who are left (behind)', but more literally 'our who are left murder', since nostra is the inflected possessive determiner for the 1st person plural. What happens here is that the possessive adds a nominal person, as it were, and this person is another referent beyond the noun caede 'murder' in this phrase; as such, the relative can target it (or at least, Cicero pleases himself in doing so). We could not really justify an analysis where we shift the relative under the head noun, since the murder is not one of its arguments.
    • Problem: the dependent of the determiner cannot be traced back to the referent of its head

To summarise the above discussion, my two proposals are to deactivate this validation rule if:

  1. the child of det is a flat relation
  2. the head element has the feature Person
amir-zeldes commented 32 minutes ago

We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:

  1. one one = "one by one"
  2. two two = "two by two, in pairs"
  3. color color = "color for color, every color"

Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the nmod:unmarked relation, which is a subtype of nmod used without a case marker.