UppsalaNLP / uuparser

A transition-based parser for Universal Dependencies with BiLSTM word and character representations.
Apache License 2.0
80 stars 25 forks source link

rroot relation in model predictions #30

Open LuDuerlich opened 3 years ago

LuDuerlich commented 3 years ago

I have been training parsers for multiple languages and observed small number of instances, where the parser predicts rroot instead of root on the dev set.

At first I thought, this could be due to typos in the training data, but I could not find any instances in any of the UD treebanks (version 2.8). Instead, I found that rroot is introduced as part of a dummy root node in read_conll in utils.py. I suppose this is not really a typo in the code, but a dummy value that is meant to be overwritten by the parser and in most cases is.

The options I set were --dynet-mem 6000 --epochs 50 --k=2 --pos-emb-size 0 --char-emb-size 100 --disable-rlmost

and I observed it in some the dev predictions starting at epoch 22 for Basque-BDT (random seed of 2) and in some of the predictions starting at the first epoch for Hindi-HDTB (random seed of 5).

mdelhoneux commented 3 years ago

Mmmh this is strange, rroot is used indeed as a dummy dependency relation for the dummy root token, it should never be used for any other token and should never be printed. This is quite hard to debug if it's that infrequent :/ It probably won't help but can you show me a sample of conllu output where this happens?

LuDuerlich commented 3 years ago

Here is some output for Basque:

# sent_id = dev-s1144
# text = Kroaziarraren kasuan, normaltzat jo behar da hori, orain artean oso gutxi jokatu baitu.
1       Kroaziarraren   kroaziar        NOUN    _       Case=Gen|Definite=Def|Number=Sing       2       n
mod    _       _
2       kasuan  kasu    NOUN    _       Animacy=Inan|Case=Ine|Definite=Def|Number=Sing  0       obl     _
       SpaceAfter=No
3       ,       ,       PUNCT   _       _       2       punct   _       _
4       normaltzat      normal  ADJ     _       Case=Ess|Definite=Ind   5       obl     _       _
5       jo      jo      VERB    _       VerbForm=Part   3       xcomp   _       _
6       behar   behar   NOUN    _       Case=Abs|Definite=Ind   7       compound        _       _
7       da      izan    VERB    _       Aspect=Prog|Mood=Ind|Number[abs]=Sing|Person[abs]=3     14      rroot   _       _
8       hori    hori    DET     _       Case=Abs|Definite=Def|Number=Sing       14      nsubj   _       SpaceAfter=No
9       ,       ,       PUNCT   _       _       7       punct   _       _
10      orain   orain   ADV     _       Case=Ine        14      advmod  _       _
11      artean  arte    ADP     _       Case=Ine        10      case    _       _
12      oso     oso     ADV     _       _       13      advmod  _       _
13      gutxi   gutxi   ADV     _       _       14      obl     _       _
14      jokatu  jokatu  VERB    _       Aspect=Perf|VerbForm=Part       5       advcl   _       _
15      baitu   *edun   AUX     _       Mood=Ind|Number[abs]=Sing|Number[erg]=Sing|Person[abs]=3|Person[erg]=3  14      aux     _       SpaceAfter=No
16      .       .       PUNCT   _       _       7       punct   _       _

# sent_id = dev-s1366
# text = "Araudia ikusita, jendea orain baino lehenago irten beharko da etxetik anbientea sortzeko...".
1       "       "       PUNCT   _       _       0       punct   _       SpaceAfter=No
2       Araudia araudi  NOUN    _       Animacy=Inan|Case=Abs|Definite=Def|Number=Sing  3       obj     _
       _
3       ikusita ikusi   VERB    _       VerbForm=Part   1       advcl   _       SpaceAfter=No
4       ,       ,       PUNCT   _       _       3       punct   _       _
5       jendea  jende   NOUN    _       Case=Abs|Definite=Def|Number=Sing       9       nsubj   _       _
6       orain   orain   ADV     _       _       7       advmod  _       _
7       baino   baino   X       _       _       9       advmod  _       _
8       lehenago        lehenago        ADV     _       _       9       advmod  _       _
9       irten   irten   VERB    _       VerbForm=Part   4       xcomp   _       _
10      beharko behar_izan      VERB    _       _       9       rroot   _       _
11      da      izan    AUX     _       Mood=Ind|Number[abs]=Sing|Person[abs]=3 10      aux     _       _
12      etxetik etxe    NOUN    _       Animacy=Inan|Case=Abl|Definite=Def|Number=Sing  14      obl     _       _
13      anbientea       anbiente        NOUN    _       Case=Abs|Definite=Def|Number=Sing       14      obj     _       _
14      sortzeko        sortu   VERB    _       Case=Abs|Definite=Ind   10      advcl   _       SpaceAfter=No
15      ...     ...     PUNCT   _       _       10      punct   _       SpaceAfter=No
16      "       "       PUNCT   _       _       10      punct   _       SpaceAfter=No
17      .       .       PUNCT   _       _       10      punct   _       _

From what I could tell there are only about 4 sentences in the Basque dev set across all training epochs where rroot has been predicted, but per epoch, it gets predicted at most twice, so there is some variation.

And Hindi:

# sent_id = dev-s139
# text = लोकसभा में पेश की गई अपनी रिपोर्ट में कमेटी का कहना है कि रेलवे को केंद्रीय मदद अब ५० फीसदी से भी अधिक मिलने लग
ी है ।
1       लोकसभा  लोकसभा  NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        4       obl     _
       Vib=0_में|Tam=0|ChunkId=NP|ChunkType=head|Translit=lokasabhā
2       में       में       ADP     PSP     AdpType=Post    1       case    _       ChunkId=NP|ChunkType=chil
d|Translit=meṁ
3       पेश      पेश      ADJ     JJ      _       4       compound        _       ChunkId=JJP|ChunkType=hea
d|Translit=peśa
4       की      कर      VERB    VM      Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part        7       a
cl     _       Vib=या_जा+या१|Tam=yA|ChunkId=VGNF|ChunkType=head|Translit=kī
5       गई      जा      AUX     VAUX    Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part        4       a
ux:pass        _       Vib=या१|Tam=yA1|ChunkId=VGNF|ChunkType=child|Translit=gaī
6       अपनी    अपना    PRON    PRP     Case=Acc|Gender=Fem|PronType=Prs        7       nmod    _       V
ib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=apanī
7       रिपोर्ट  रिपोर्ट  NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        0       obl     _
       Vib=0_में|Tam=0|ChunkId=NP3|ChunkType=head|Translit=riporṭa
8       में       में       ADP     PSP     AdpType=Post    7       case    _       ChunkId=NP3|ChunkType=chi
ld|Translit=meṁ
9       कमेटी    कमेटी    NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        11      nsubj   _
       Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=kameṭī
10      का      का      ADP     PSP     AdpType=Post|Case=Nom|Gender=Masc|Number=Sing   9       case    _
       ChunkId=NP4|ChunkType=child|Translit=kā
11      कहना    कह      VERB    VM      Case=Nom|VerbForm=Inf   7       amod    _       Vib=ना|Tam=nA|Chu
nkId=VGNN|ChunkType=head|Translit=kahanā
12      है       है       VERB    VM      Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 24      rroot   _       Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
13      कि      कि      SCONJ   CC      _       24      mark    _       AltTag=SCONJ-CONJ|ChunkId=CCP|ChunkType=head|Translit=ki
14      रेलवे     रेलवे     NOUN    NN      Case=Acc|Gender=Masc|Number=Sing|Person=3       24      nsubj   _       Vib=0_को|Tam=0|ChunkId=NP5|ChunkType=head|Translit=relave
15      को      को      ADP     PSP     AdpType=Post    14      case    _       ChunkId=NP5|ChunkType=child|Translit=ko
16      केंद्रीय   केंद्रीय   ADJ     JJ      Case=Nom        17      compound        _       ChunkId=NP6|ChunkType=child|Translit=keṁdrīya
17      मदद     मदद     NOUN    NN      Case=Nom|Gender=Fem|Number=Sing|Person=3        24      nsubj   _       Vib=0|Tam=0|ChunkId=NP6|ChunkType=head|Translit=madada
18      अब      अब      PRON    PRP     Case=Nom|PronType=Prs   24      obl     _       ChunkId=NP7|ChunkType=head|Translit=aba
19      ५०      ५०      NUM     QC      NumType=Card    20      nummod  _       ChunkId=NP8|ChunkType=child|Translit=50
20      फीसदी   फीसदी   NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        24      obl     _       Vib=0_से|Tam=0|ChunkId=NP8|ChunkType=head|Translit=phīsadī
21      से       से       ADP     PSP     AdpType=Post    20      case    _       ChunkId=NP8|ChunkType=child|Translit=se
22      भी      भी      PART    RP      _       20      dep     _       ChunkId=NP8|ChunkType=child|Translit=bhī
23      अधिक    अधिक    DET     QF      PronType=Ind    24      nsubj   _       AltTag=ADJ-DET|ChunkId=JJP2|ChunkType=head|Translit=adhika
24      मिलने    मिल     VERB    VM      Gender=Fem|Number=Sing|Person=3|VerbForm=Inf|Voice=Act  11      obj     _       Vib=ना_लग+या_है|Tam=nA|ChunkId=VGF2|ChunkType=head|Stype=declarative|Translit=milane
25      लगी     लग      AUX     VAUX    Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part        24      aux     _       Vib=या|Tam=yA|ChunkId=VGF2|ChunkType=child|Translit=lagī
26      है       है       AUX     VAUX    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   24      aux:pass        _       Vib=है|Tam=hE|ChunkId=VGF2|ChunkType=child|Translit=hai
27      ।       ।       PUNCT   SYM     _       12      punct   _       ChunkId=BLK|ChunkType=head|Translit=.

# sent_id = dev-s177
# text = लेकिन हम लोगों का मानना है कि राष्ट्रपति, प्रधानमंत्री और मुख्य न्यायाधीश को कम से कम इससे बाहर होना चाहिए ।
1       लेकिन    लेकिन    CCONJ   CC      _       0       cc      _       ChunkId=CCP|ChunkType=head|Transl
it=lekina
2       हम      हम      DET     DEM     Case=Nom|Number=Plur|Person=1|PronType=Dem      3       det     _
       ChunkId=NP|ChunkType=child|Translit=hama
3       लोगों    लोग     NOUN    NN      Case=Acc|Gender=Masc|Number=Plur|Person=3       5       nsubj   _
       Vib=0_का|Tam=0|ChunkId=NP|ChunkType=head|Translit=logoṁ
4       का      का      ADP     PSP     AdpType=Post|Case=Nom|Gender=Masc|Number=Sing   3       case    _
       ChunkId=NP|ChunkType=child|Translit=kā
5       मानना   मान     VERB    VM      Case=Nom|VerbForm=Inf   1       mark    _       Vib=ना|Tam=nA|Chu
nkId=VGNN|ChunkType=head|Translit=mānanā
6       है       है       VERB    VM      Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 20      rroot   _       Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
7       कि      कि      SCONJ   CC      _       20      mark    _       AltTag=SCONJ-CONJ|ChunkId=CCP2|ChunkType=head|Translit=ki
8       राष्ट्रपति        राष्ट्रपति        PROPN   NNP     Case=Acc|Gender=Masc|Number=Sing|Person=3       20      nsubj   _       SpaceAfter=No|Vib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=rāṣṭrapati
9       ,       ,       PUNCT   SYM     _       10      punct   _       ChunkId=NP2|ChunkType=child|Translit=,
10      प्रधानमंत्री       प्रधानमंत्री       PROPN   NNP     Case=Acc|Gender=Masc|Number=Sing|Person=3       8       conj    _       Vib=0|Tam=0|ChunkId=NP3|ChunkType=head|Translit=pradhānamaṁtrī
11      और      और      CCONJ   CC      _       13      cc      _       ChunkId=CCP3|ChunkType=head|Translit=aura
12      मुख्य     मुख्य     NOUN    NNC     Case=Nom|Gender=Masc|Number=Sing|Person=3       13      amod    _       Vib=0|Tam=0|ChunkId=NP4|ChunkType=child|Translit=mukhya
13      न्यायाधीश        न्यायाधीश        NOUN    NN      Case=Acc|Gender=Masc|Number=Sing|Person=3       8       conj    _       Vib=0_को|Tam=0|ChunkId=NP4|ChunkType=head|Translit=nyāyādhīśa
14      को      को      ADP     PSP     AdpType=Post    13      case    _       ChunkId=NP4|ChunkType=child|Translit=ko
15      कम      कम      DET     QF      PronType=Ind    18      det     _       ChunkId=NP5|ChunkType=child|Translit=kama
16      से       से       PART    RP      _       15      dep     _       ChunkId=NP5|ChunkType=child|Translit=se
17      कम      कम      DET     QF      PronType=Ind    18      det     _       AltTag=ADJ-DET|ChunkId=NP5|ChunkType=head|Translit=kama
18      इससे     यह      PRON    PRP     Case=Acc,Ins|Number=Sing|Person=3|PronType=Prs  20      obl     _       Vib=से|Tam=se|ChunkId=NP6|ChunkType=head|Translit=isase
19      बाहर    बाहर    ADV     NST     AdpType=Post|Case=Nom|Gender=Masc|Number=Sing|Person=3  18      case    _       AltTag=ADV-NOUN|ChunkId=NP7|ChunkType=head|Translit=bāhara
20      होना    हो      VERB    VM      Gender=Masc|VerbForm=Inf|Voice=Act      5       obj     _       Vib=ना_चाहिए|Tam=nA|ChunkId=VGF2|ChunkType=head|Stype=declarative|Translit=honā
21      चाहिए   चाहिए   AUX     VAUX    _       20      aux     _       Vib=0|Tam=0|ChunkId=VGF2|ChunkType=child|Translit=cāhie
22      ।       ।       PUNCT   SYM     _       6       punct   _       ChunkId=BLK|ChunkType=head|Translit=.

Here, there appear to be more instances. In some epochs, rroot gets predicted as much as 17 times.

mdelhoneux commented 3 years ago

Thanks! These two sentences are non-projective. My suspicion is that it might be due to the max_swap in Predict, in uuparser/arc_hybrid.py which should actually not be necessary, I used this in early debugging days but never went back to change it. Could you try setting max_swap to inf or len(sentence)*len(sentence)? In this line: https://github.com/UppsalaNLP/uuparser/blob/c0d8a8210c1032272dfad9250a765f09e128976f/uuparser/arc_hybrid.py#L287

LuDuerlich commented 3 years ago

I tried both versions:

mdelhoneux commented 3 years ago

Ok, thanks! I still think it must have something to do with non-projectivity and the use of swap but I have no idea what specifically at this point. I will take a look but it probably won't be this week, sorry :/ Theoretically actually there should be no difference between len(sentence)**2 and inf. This is because any pair of two words can only be swapped once. So it probably has something to do with the conditions for swap lines 174 to 182. There might be an edge case we did not cover?