Closed ioan2 closed 5 years ago
Hi Johannes, punctuation should never be attached via non-projective dependencies (http://universaldependencies.org/u/dep/punct.html), perhaps with the exception of some non-projectivities that are caused by other nodes and the punctuation symbol is just taken along. The second comma in de/dev-s23 is clearly a bug, it should be attached to bekomme.
Hi, thanks for your confirmation, I thought so, but did not dare to ask. In fact at least in UD_French and UD_German most non-projective sentences are caused by the attachment of punctuation symbols. I will prepare a list
Thans @martinpopel. It sounds like we should take a more general approach to this and try to fix it once and for all for version 2.1.
I've improved the ud.MarkBugs tests, so now in addition to non-projectively attached PUNCT nodes, it reports also PUNCT nodes which are causing a non-projectivity of another edge. Both these cases are forbidden according to the guidelines.
UDv2.0 contains many treebanks with a high number of such errors:
2121 ca
1839 pt
1682 eu
1443 es_ancora
1362 ar_nyuad
1347 grc
827 nl_lassysmall
805 en_lines
745 fr
639 de
605 cs
etc.
A visualization of all such errors for all treebanks is available here.
Why to bother about punctuation attachment?
I've implemented a Udapi block ud.FixPunct which fixes the attachment of punctuation automatically (strictly following the guidelines, but in case of several possible attachments it respects the original annotation). It fixes most of the non-projective punctuation, only in rare cases it may not succeed, but these cases are usually worth of manual inspection anyway. Feel free to use it for your treebank (ask me if you need a help or want to see a visualization of the changes).
thanks, @martinpopel I checked the UD_Russian data and found a chain of "list" elements (X, Y, Z, W...) in which the brackets are somehow displaced: X (Y, Z, W...). So, we have a non-projectivity which cannot be considered a "norm" but still cannot be "fobidden".I think that the similar cases can pop up in conj groups: apples (and oranges, and bananas) as well as in mwe-s: and so on (and so forth) Olga
apples (and oranges, and bananas)
The guidelines say the brackets "should be attached to the same word unless that would create non-projectivity" and this phrase is a nice example why the condition is needed. Here the phrase enclosed in the brackets does not have a single head, so the obvious solution (at least as implemented in ud.FixPunct) is to attach the left bracket to oranges and the right bracket to bananas.
so on (and so forth)
I guess here both brackets should be attached to forth, which should be attached to on as conj
(rather than fixed
).
There are rare cases where you need to break one or another UD rule. For example, if there are brackets enclosing a phrase with no content words, we either need to attach the brackets to a function word (which is not listed among the four exceptions) or we need to attach the paired brackets to a word outside of those brackets.
a chain of "list" elements (X, Y, Z, W...) in which the brackets are somehow displaced: X (Y, Z, W...)
This is the same as "apples (and oranges, and bananas)". I am not sure where goes the boundary between conj
and list
(the guidelines say list
should not be overused; I think asyndetic coordination should be still annotated as coordination with conj
), but this is irrelevant for the question of punctuation attachment.
In UD_Russian, I see sentence dev-s113 with (simplified) семья (папа, мама, брат), where мама and брат are attached as list
to семья, but I think this is an error (which would probably remain unnoticed without the punct-nonproj
test): they should be attached to папа (which is correctly attached to семья as appos
).
yes, generally the script helps us to reveal lots of errors. Thanks once more =Ok, we will follow the "right bracket to the last element" rule. Olga
Thanks, Martin. Very useful! We should really make a systematic effort to make use of all this information (once we have recovered from workshops and shared tasks).
Joakim
We can annotate the example X ( Y, Z,W ...) by keeping the projectivity and by attaching the brackets to the head of the expression they delimit. For this purpose, we consider that (Y,Z,W ...) is an enumeration that is part of another enumeration constiting of two elements: X and (Y,Z,W ...). In this view, brackets depend on Y and only Y depends on X. Such annotation is consistent with the interpretation of brackets : Y, Z and W are not at the same level as X in the enumeration. There is a more problematic case. Sometimes, quotation marks can cause non-projectivity because a word outside the quotation marks can depend on a word that is part of the expression in quotation marks without being the head.
_Example: UD_French-GSD, fr-ud-train05807 les rebelles avaient été « informés par les autorités françaises de la présence de ces hommes » à Benghazi (the rebels had been "informed by the French authorities of the presence of these men" in Benghazi)
"Benghazi" depends on "presence", which is not the head of the expression in quotation marks. The head is "informés" and it makes sense to attach the quotation marks to "informés", which cause non-projectivity. Of course, it is possible to tinker with the annotation : attach the first mark to "informés" and the second mark to "presence" or "men" for example. I prefer to add an exception to the guidelines only for quotation marks and to allow non-projectivity for them.
Hello,
When working with version 2.0 of UD_French and UD_German, I noticed that the attachments of some punctuation symbols (but not only) look strange, and make the dependency tree non-projective. For instance in de_ud_dev.conll, sentence 23 ("Ich dachte mir , dieser Shop ist für seine Schnäppchen bekannt , also bekomme ich gute Markenware zu dem günstigen Preis .") the comma after "bekannt" is attached to "dachte"., Similarly sentence 126 (which is ungrammatical, and has a non-projective tree)
I found similar cases in UD_French and UD_Persian (where beginning and ending quotes are both attached to an preceding adposition (sentence 289)
What is the motivation of the attachment which makes trees non-projective ?
Best Regards Johannes