malformed parse for sent_id = n01012003

bansp commented 4 years ago

Hi and please excuse me if this is posted in the wrong place. (Did read the contributing guidelines but found no suggestion of a better target).

I have stumbled on a parse error in the following sentence in the current form of the UD_English-PUD treebank:

# newdoc id = n01012
# sent_id = n01012003
# text = First one of the Yazidi women started crying, then one of her friends.
1   First   first   ADV RB  _   7   advmod  7:advmod    _
2   one one NUM CD  NumType=Card    7   nsubj   7:nsubj|8:nsubj:xsubj|11:nsubj  _
3   of  of  ADP IN  _   6   case    6:case  _
4   the the DET DT  Definite=Def|PronType=Art   6   det 6:det   _
5   Yazidi  Yazidi  PROPN   NNP Number=Sing 6   compound    6:compound  _
6   women   woman   NOUN    NNS Number=Plur 2   nmod    2:nmod:of   _
7   started start   VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    0   root    0:root  _
7.1 started start   VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    _   _   7:conj  _
8   crying  cry VERB    VBG VerbForm=Ger    7   xcomp   7:xcomp SpaceAfter=No
9   ,   ,   PUNCT   ,   _   11  punct   11:punct    _
10  then    then    ADV RB  PronType=Dem    11  orphan  7.1:nsubj   _
11  one one NUM CD  NumType=Card    7   conj    7:conj|7.1:xcomp    _
12  of  of  ADP IN  _   14  case    14:case _
13  her she PRON    PRP$    Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs   14  nmod:poss   14:nmod:poss    _
14  friends friend  NOUN    NNS Number=Plur 11  nmod    11:nmod:of  SpaceAfter=No
15  .   .   PUNCT   .   _   7   punct   7:punct _

The 7 vs. 7.1 seems to be the offending bit.

Since I am totally lost when it comes to unravelling the conjunction magic of UD and dependencies in general, I am unable to suggest a correction but hope that someone else will be able to fix that :-) Thanks in advance!

bansp commented 4 years ago

The reason I raised this issue was that I got an explicit error when importing this dataset into INCEpTION, and saw weird parses in the viewers provided by UDPipe and Tündra (while the visualization by conllu-viewer made me think that I wasn't perhaps getting the entire picture).

I have now skimmed through (Schuster & Manning, 2016) and some docs, and understand that this is an enhanced representation of elision under conjunction. The new question that this raises for me comes from the absence of similar analyses for the parallel languages that I have had a look at, namely German, French, Italian, and Polish.

In Polish, in particular, the corresponding sentence is

# sent_id = s25
# text = Najpierw zaczęła płakać jedna z jezydek, potem jej przyjaciółka.
# orig_file_sentence = n01012003#25
# conversion_status = complete
1   Najpierw    najpierw    ADV adv _   2   advmod  2:advmod    _
2   zaczęła zacząć  VERB    praet:sg:f:perf Aspect=Perf|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act   0   root    0:root  _
3   płakać  płakać  VERB    inf:imperf  Aspect=Imp|VerbForm=Inf|Voice=Act   2   xcomp   2:xcomp _
4   jedna   jeden   ADJ adj:sg:nom:f:pos    Case=Nom|Degree=Pos|Gender=Fem|Number=Sing  2   nsubj   2:nsubj _
5   z   z   ADP prep:gen:nwok   AdpType=Prep|Variant=Short  6   case    6:case  Case=Gen
6   jezydek jezydka NOUN    subst:pl:gen:f  Case=Gen|Gender=Fem|Number=Plur 4   obl 4:obl   SpaceAfter=No
7   ,   ,   PUNCT   interp  PunctType=Comm  10  punct   10:punct    _
8   potem   potem   ADV adv _   10  advmod  10:advmod   _
9   jej on  PRON    ppron3:sg:gen:f:ter:akc:npraep  Case=Gen|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long 10  nmod    10:nmod _
10  przyjaciółka    przyjaciółka    NOUN    subst:sg:nom:f  Case=Nom|Gender=Fem|Number=Sing 2   conj    0:root|2:conj   SpaceAfter=No
11  .   .   PUNCT   interp  PunctType=Peri  2   punct   2:punct _

where zaczęła (płakać) could be analysed as elided as well, but isn't (and I'm ignoring the appearance of the second root here, which signals to me that I know even less about this kind of dependencies than I thought I did).

My question/worry is: at least when set against the analyses in de, fr, it, and pl, doesn't the en representation using the dot-based elision mechanism wrongly suggest that there is something language-particular about this kind of ellipsis? Or, more poetically: how parallel can one realistically expect the PUD datasets to be, given (as far as I understand) different analyses for phenomena that should in theory receive parallel treatment?

Thanks in advance and best wishes :-)

jnivre commented 4 years ago

The introduction of null nodes in the enhanced dependencies is triggered by the relation "orphan" in the basic dependencies. The "orphan" relation in turn should be used when another relation would be misleading because a word is attached to a promoted head which is really a co-dependent (see https://universaldependencies.org/u/overview/specific-syntax.html#ellipsis).

The difference between the English and Polish annotation is that the English annotators have judged the attachment of the adverb "then" to "one" with the relation "advmod" misleading and have therefore used the "orphan" relation instead (which leads to the introduction of the null node in enhanced dependencies). By contrast, the Polish annotators have used the "advmod" relation instead and therefore there is no null node.

For what it is worth, I checked the Swedish PUD treebank (which we annotated in Uppsala) and it follows the English analysis (with "orphan" and a null node). Possibly the same should have been done in Polish. In all fairness, however, it should be pointed out the guidelines are not precise enough here. It is quite clear that the "orphan" relation should be used when a core argument is attached to another core argument (as in "I like coffee and you tea") but it is not clear whether this extends to adverbial modifiers. Hence, the guidelines need to be specified better.

bansp commented 4 years ago

Thanks a lot for the reply, Joakim. Diving deeper into the UD literature now... :-) And closing this issue. Best wishes, Piotr

UniversalDependencies / UD_English-PUD

malformed parse for sent_id = n01012003 #1