Open nschneid opened 1 week ago
The errors in Hebrew are due to things like
# x- so the RTL text doesn't make this unreadable
32 x-ה x-ה DET art PronType=Art 33 det _ Gloss=the|Ref=GEN_19.8
33 x-אֲנָשִׁ֤ים x-אישׁ NOUN subs Gender=Masc|Number=Plur 38 obl _ Gloss=man|Ref=GEN_19.8
34-35 x-הָאֵל֙ x-_ _ _ _ _ _ _ _
34 x-הָ x-ה DET art PronType=Art 35 det _ Gloss=the|Ref=GEN_19.8
35 x-אֵל֙ x-אל PRON prde Number=Plur|PronType=Dem 33 det _ Gloss=these|Ref=GEN_19.8
where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.)
@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew)
If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word.
I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things.
I have one remaining error:
[(in gd_arcosg-ud-train.conllu) Line 55940 Sent p01_033h Node 79]: [L3 Syntax leaf-det-clf] 'det' not expected to have children (79:a:det --> 81:h-uile:compound)
compound
for reduplication: https://universaldependencies.org/gd/dep/compound.htmlThe offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception?
Repetition for emphasis: would flat
be a good option instead of compound
? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).
The validator currently allows fixed
, but not flat
, it seems.
This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an ADP' and the second is a
DET' that depends on it with the `case' relation.
How should we handle this better?
unter anderem is sometimes treated as a fixed
expression. Here is a case triggering the error:
I assume this means "among other teachers"—is there a reason not to analyze it as "among [other teachers]", with unter attaching to Lehrer?
No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others".
Repetition for emphasis: would
flat
be a good option instead ofcompound
? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).The validator currently allows
fixed
, but notflat
, it seems.
What about flat:redup
to mark repetition for emphasis?
Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE:
For spoken data, we need three relations to be added to the validator:
discourse
, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"parataxis
for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum
link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth
in our spoken French treebanks.dep
for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as discourse
which is dependent of pronoun, and pronoun occasionally becomes det
if the expression describes a noun. This leads to validation error.
The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs.
We would like to annotate these expressions as compound
(instead of fixed
) because the pronoun is the second element in the phrase and we feel that it is the head of the phrase because the pronoun inflects together with a noun and bears the most of semantic meaning of the expression.
Would you please consider allowing compound
in this construction or is there any other option appropriate here?
@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule?
I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.
flat:redup
in Latin treebanks. One example is quot quot from quot: while the latter means 'as many as', the reduplication has a distributive sense as in 'for each possible one...' (this expression is sometimes even univerbated). I think to annotate them separately, each depending on the head, is not the right way to deal with them: here we do not have two or more different terms, but really the same one "clonating" itself. On the other hand, flat
is really the closest relation we have to fixed
, which would cause no problem, but is not a correct choice is (well, in my opinion it is never the correct choice)
To summarise the above discussion, my two proposals are to deactivate this validation rule if:
det
is a flat
relationPerson
We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:
Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the nmod:unmarked
relation, which is a subtype of nmod
used without a case
marker.
I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.
Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:
det
+nmod
e.g. "at least some reports" (det(reports, some)
,nmod(some, least)
). "at least" is admittedly ADV-like, so another option is to make itExtPos=ADV
andadvmod
.det
licensing anadvcl
, as in these results. The guidelines on sufficiency and excess for "so" and similar say theadvcl
should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have anadvcl
dependent?