UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Non-projective punctuation #440

Closed ioan2 closed 5 years ago

ioan2 commented 7 years ago

Hello,

When working with version 2.0 of UD_French and UD_German, I noticed that the attachments of some punctuation symbols (but not only) look strange, and make the dependency tree non-projective. For instance in de_ud_dev.conll, sentence 23 ("Ich dachte mir , dieser Shop ist für seine Schnäppchen bekannt , also bekomme ich gute Markenware zu dem günstigen Preis .") the comma after "bekannt" is attached to "dachte"., Similarly sentence 126 (which is ungrammatical, and has a non-projective tree)

I found similar cases in UD_French and UD_Persian (where beginning and ending quotes are both attached to an preceding adposition (sentence 289)

What is the motivation of the attachment which makes trees non-projective ?

Best Regards Johannes

dan-zeman commented 7 years ago

Hi Johannes, punctuation should never be attached via non-projective dependencies (http://universaldependencies.org/u/dep/punct.html), perhaps with the exception of some non-projectivities that are caused by other nodes and the punctuation symbol is just taken along. The second comma in de/dev-s23 is clearly a bug, it should be attached to bekomme.

ioan2 commented 7 years ago

Hi, thanks for your confirmation, I thought so, but did not dare to ask. In fact at least in UD_French and UD_German most non-projective sentences are caused by the attachment of punctuation symbols. I will prepare a list

martinpopel commented 7 years ago

I've generated the list for UD_French and UD_German, simply by cat *.conllu | udapy -HM ud.MarkBugs tests=punct-nonproj > punct-nonproj.html. See Udapi. Of course, there are many other treebanks with this kind of error.

jnivre commented 7 years ago

Thans @martinpopel. It sounds like we should take a more general approach to this and try to fix it once and for all for version 2.1.

martinpopel commented 7 years ago

I've improved the ud.MarkBugs tests, so now in addition to non-projectively attached PUNCT nodes, it reports also PUNCT nodes which are causing a non-projectivity of another edge. Both these cases are forbidden according to the guidelines.

UDv2.0 contains many treebanks with a high number of such errors:

 2121 ca
 1839 pt
 1682 eu
 1443 es_ancora
 1362 ar_nyuad
 1347 grc
  827 nl_lassysmall
  805 en_lines
  745 fr
  639 de
  605 cs
 etc.

A visualization of all such errors for all treebanks is available here.

Why to bother about punctuation attachment?

I've implemented a Udapi block ud.FixPunct which fixes the attachment of punctuation automatically (strictly following the guidelines, but in case of several possible attachments it respects the original annotation). It fixes most of the non-projective punctuation, only in rare cases it may not succeed, but these cases are usually worth of manual inspection anyway. Feel free to use it for your treebank (ask me if you need a help or want to see a visualization of the changes).

olesar commented 7 years ago

thanks, @martinpopel I checked the UD_Russian data and found a chain of "list" elements (X, Y, Z, W...) in which the brackets are somehow displaced: X (Y, Z, W...). So, we have a non-projectivity which cannot be considered a "norm" but still cannot be "fobidden".I think that the similar cases can pop up in conj groups: apples (and oranges, and bananas) as well as in mwe-s: and so on (and so forth) Olga

martinpopel commented 7 years ago

apples (and oranges, and bananas)

The guidelines say the brackets "should be attached to the same word unless that would create non-projectivity" and this phrase is a nice example why the condition is needed. Here the phrase enclosed in the brackets does not have a single head, so the obvious solution (at least as implemented in ud.FixPunct) is to attach the left bracket to oranges and the right bracket to bananas.

so on (and so forth)

I guess here both brackets should be attached to forth, which should be attached to on as conj (rather than fixed).

There are rare cases where you need to break one or another UD rule. For example, if there are brackets enclosing a phrase with no content words, we either need to attach the brackets to a function word (which is not listed among the four exceptions) or we need to attach the paired brackets to a word outside of those brackets.

a chain of "list" elements (X, Y, Z, W...) in which the brackets are somehow displaced: X (Y, Z, W...)

This is the same as "apples (and oranges, and bananas)". I am not sure where goes the boundary between conj and list (the guidelines say list should not be overused; I think asyndetic coordination should be still annotated as coordination with conj), but this is irrelevant for the question of punctuation attachment.

In UD_Russian, I see sentence dev-s113 with (simplified) семья (папа, мама, брат), where мама and брат are attached as list to семья, but I think this is an error (which would probably remain unnoticed without the punct-nonproj test): they should be attached to папа (which is correctly attached to семья as appos).

olesar commented 7 years ago

yes, generally the script helps us to reveal lots of errors. Thanks once more =Ok, we will follow the "right bracket to the last element" rule. Olga

jnivre commented 7 years ago

Thanks, Martin. Very useful! We should really make a systematic effort to make use of all this information (once we have recovered from workshops and shared tasks).

Joakim

perrier54 commented 5 years ago

We can annotate the example X ( Y, Z,W ...) by keeping the projectivity and by attaching the brackets to the head of the expression they delimit. For this purpose, we consider that (Y,Z,W ...) is an enumeration that is part of another enumeration constiting of two elements: X and (Y,Z,W ...). In this view, brackets depend on Y and only Y depends on X. Such annotation is consistent with the interpretation of brackets : Y, Z and W are not at the same level as X in the enumeration. There is a more problematic case. Sometimes, quotation marks can cause non-projectivity because a word outside the quotation marks can depend on a word that is part of the expression in quotation marks without being the head.

_Example: UD_French-GSD, fr-ud-train05807 les rebelles avaient été « informés par les autorités françaises de la présence de ces hommes » à Benghazi (the rebels had been "informed by the French authorities of the presence of these men" in Benghazi)

"Benghazi" depends on "presence", which is not the head of the expression in quotation marks. The head is "informés" and it makes sense to attach the quotation marks to "informés", which cause non-projectivity. Of course, it is possible to tinker with the annotation : attach the first mark to "informés" and the second mark to "presence" or "men" for example. I prefer to add an exception to the guidelines only for quotation marks and to allow non-projectivity for them.