UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
203 stars 43 forks source link

Incorrect flagging of non-projective punctuation in validator #52

Closed sebschu closed 5 years ago

sebschu commented 5 years ago

For this sentence, the validator claims that the punctuation mark node 14 is introducing non-projectivity. However, non-projective relations (introduced by the reparandum relation) would also exist without the punctuation mark and I think the structure of this tree adheres to the punctuation guidelines.

# sent_id = n01002058
# text = What she’s saying and what she’s doing, it — actually, it’s unbelievable.
1       What    what    PRON    WP      PronType=Int    4       obj     4:obj   _
2       she     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   4       nsubj   4:nsubj SpaceAfter=No
3       ’s      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       aux     4:aux   _
4       saying  say     VERB    VBG     VerbForm=Ger    17      dislocated      17:dislocated   _
5       and     and     CCONJ   CC      _       9       cc      9:cc    _
6       what    what    PRON    WP      PronType=Int    9       obj     9:obj   _
7       she     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   9       nsubj   9:nsubj SpaceAfter=No
8       ’s      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   9       aux     9:aux   _
9       doing   do      VERB    VBG     VerbForm=Ger    4       conj    4:conj:and      SpaceAfter=No
10      ,       ,       PUNCT   ,       _       17      punct   17:punct        _
11      it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  15      reparandum      15:reparandum   _
12      —       —       PUNCT   :       _       11      punct   11:punct        _
13      actually        actually        ADV     RB      _       17      advmod  17:advmod       SpaceAfter=No
14      ,       ,       PUNCT   ,       _       17      punct   17:punct        _
15      it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  17      nsubj   17:nsubj        SpaceAfter=No
16      ’s      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   17      cop     17:cop  _
17      unbelievable    unbelievable    ADJ     JJ      Degree=Pos      0       root    0:root  SpaceAfter=No
18      .       .       PUNCT   .       _       17      punct   17:punct        _
martinpopel commented 5 years ago

See https://github.com/udapi/udapi-python/issues/52#issuecomment-486943687 BTW: ud.FixPunct fixes your sentence automatically.

sebschu commented 5 years ago

But from that discussion:

Yes. Attachment of punctuation should not cause non-projectivity. It can cause non-projectivity either because the punctuation node is attached non-projectively, or because it creates a gap (containing the punctuation) due to which other dependencies become non-projective. But if there is another node in the gap and the punctuation is attached to that node, then the punctuation is not taken as the cause of the non-projectivity.

AFAICS, in this example, the punctuation is not creating a gap and I don't think it would be correct to attach the comma in 14 to actually in 13. Or am I missing something here?

dan-zeman commented 5 years ago

I don't think attaching 14 to 13 would be wrong; in fact, it would be my preferred choice if I were annotating this sentence manually. My reason for that would be that the comma is delimiting actually from its head, hence in some sense it belongs to actually (it is there because of actually).

It was not necessarily my intention to prevent you from attaching 14 to 17. But now when I checked my code, I see that I actually assume that the tested punctuation node is attached to one of the nodes in the same gap. That is, the nonprojectivity is not reported if I find out that the parent of the current node lies in the same gap. punct(13, 14) would thus be valid.

So the question is, are we OK with this slightly stricter constraint, or should I modify the code so that just the mere presence of another non-punctuation node in the same gap makes the punctuation valid?

martinpopel commented 5 years ago

The comma at position 14 is causing a non-projective gap (by definition). If you attach the comma to "actually", it is only "actually" which is causing the non-projective gap. This is exactly what Dan wrote in the quoted discussion:

But if there is another node in the gap and the punctuation is attached to that node, then the punctuation is not taken as the cause of the non-projectivity.

So if the punctuation is just in the gap as a sibling with another node, this is still considered a bug by the validator. If you attach the punctuation to another non-punctuation node in the gap, you'll make the validator happy. In other words (or rather pictures):

─┮
 │   ╭─╼ What
 │   ┢─╼ she
 │   ┢─╼ ’s
 │ ╭─┾ saying
 │ │ │ ╭─╼ and
 │ │ │ ┢─╼ what
 │ │ │ ┢─╼ she
 │ │ │ ┢─╼ ’s
 │ │ ╰─┶ doing
 │ ┢─╼ ,
 │ │            ╭─┮ it
 │ │            │ ╰─╼ —
 │ ┢─╼ actually │
 │ ┢─╼ ,        │ <-- punct causing a non-projective gap = BUG
 │ ┢────────────┶ it
 │ ┢─╼ ’s
 ╰─┾ unbelievable
   ╰─╼ .
─┮
 │   ╭─╼ What
 │   ┢─╼ she
 │   ┢─╼ ’s
 │ ╭─┾ saying
 │ │ │ ╭─╼ and
 │ │ │ ┢─╼ what
 │ │ │ ┢─╼ she
 │ │ │ ┢─╼ ’s
 │ │ ┡─┶ doing
 │ │ ╰─╼ ,
 │ │            ╭─┮ it
 │ │            │ ╰─╼ —
 │ ┢─┮ actually │ <-- non-punct causing a non-proj gap = OK
 │ │ ╰─╼ ,      │ <-- punct in a gap, but not causing it = OK
 │ ┢────────────┶ it
 │ ┢─╼ ’s
 ╰─┾ unbelievable
   ╰─╼ .
dan-zeman commented 5 years ago

EDIT: I did not read my own comment quoted by @sebschu carefully enough :-)

But if there is another node in the gap and the punctuation is attached to that node, then the punctuation is not taken as the cause

So I actually said it right.

sebschu commented 5 years ago

Ok, yeah, this makes sense. Thanks for clarifying!