Open LuDuerlich opened 3 years ago
Mmmh this is strange, rroot is used indeed as a dummy dependency relation for the dummy root token, it should never be used for any other token and should never be printed. This is quite hard to debug if it's that infrequent :/ It probably won't help but can you show me a sample of conllu output where this happens?
Here is some output for Basque:
# sent_id = dev-s1144
# text = Kroaziarraren kasuan, normaltzat jo behar da hori, orain artean oso gutxi jokatu baitu.
1 Kroaziarraren kroaziar NOUN _ Case=Gen|Definite=Def|Number=Sing 2 n
mod _ _
2 kasuan kasu NOUN _ Animacy=Inan|Case=Ine|Definite=Def|Number=Sing 0 obl _
SpaceAfter=No
3 , , PUNCT _ _ 2 punct _ _
4 normaltzat normal ADJ _ Case=Ess|Definite=Ind 5 obl _ _
5 jo jo VERB _ VerbForm=Part 3 xcomp _ _
6 behar behar NOUN _ Case=Abs|Definite=Ind 7 compound _ _
7 da izan VERB _ Aspect=Prog|Mood=Ind|Number[abs]=Sing|Person[abs]=3 14 rroot _ _
8 hori hori DET _ Case=Abs|Definite=Def|Number=Sing 14 nsubj _ SpaceAfter=No
9 , , PUNCT _ _ 7 punct _ _
10 orain orain ADV _ Case=Ine 14 advmod _ _
11 artean arte ADP _ Case=Ine 10 case _ _
12 oso oso ADV _ _ 13 advmod _ _
13 gutxi gutxi ADV _ _ 14 obl _ _
14 jokatu jokatu VERB _ Aspect=Perf|VerbForm=Part 5 advcl _ _
15 baitu *edun AUX _ Mood=Ind|Number[abs]=Sing|Number[erg]=Sing|Person[abs]=3|Person[erg]=3 14 aux _ SpaceAfter=No
16 . . PUNCT _ _ 7 punct _ _
# sent_id = dev-s1366
# text = "Araudia ikusita, jendea orain baino lehenago irten beharko da etxetik anbientea sortzeko...".
1 " " PUNCT _ _ 0 punct _ SpaceAfter=No
2 Araudia araudi NOUN _ Animacy=Inan|Case=Abs|Definite=Def|Number=Sing 3 obj _
_
3 ikusita ikusi VERB _ VerbForm=Part 1 advcl _ SpaceAfter=No
4 , , PUNCT _ _ 3 punct _ _
5 jendea jende NOUN _ Case=Abs|Definite=Def|Number=Sing 9 nsubj _ _
6 orain orain ADV _ _ 7 advmod _ _
7 baino baino X _ _ 9 advmod _ _
8 lehenago lehenago ADV _ _ 9 advmod _ _
9 irten irten VERB _ VerbForm=Part 4 xcomp _ _
10 beharko behar_izan VERB _ _ 9 rroot _ _
11 da izan AUX _ Mood=Ind|Number[abs]=Sing|Person[abs]=3 10 aux _ _
12 etxetik etxe NOUN _ Animacy=Inan|Case=Abl|Definite=Def|Number=Sing 14 obl _ _
13 anbientea anbiente NOUN _ Case=Abs|Definite=Def|Number=Sing 14 obj _ _
14 sortzeko sortu VERB _ Case=Abs|Definite=Ind 10 advcl _ SpaceAfter=No
15 ... ... PUNCT _ _ 10 punct _ SpaceAfter=No
16 " " PUNCT _ _ 10 punct _ SpaceAfter=No
17 . . PUNCT _ _ 10 punct _ _
From what I could tell there are only about 4 sentences in the Basque dev set across all training epochs where rroot has been predicted, but per epoch, it gets predicted at most twice, so there is some variation.
And Hindi:
# sent_id = dev-s139
# text = लोकसभा में पेश की गई अपनी रिपोर्ट में कमेटी का कहना है कि रेलवे को केंद्रीय मदद अब ५० फीसदी से भी अधिक मिलने लग
ी है ।
1 लोकसभा लोकसभा NOUN NN Case=Acc|Gender=Fem|Number=Sing|Person=3 4 obl _
Vib=0_में|Tam=0|ChunkId=NP|ChunkType=head|Translit=lokasabhā
2 में में ADP PSP AdpType=Post 1 case _ ChunkId=NP|ChunkType=chil
d|Translit=meṁ
3 पेश पेश ADJ JJ _ 4 compound _ ChunkId=JJP|ChunkType=hea
d|Translit=peśa
4 की कर VERB VM Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part 7 a
cl _ Vib=या_जा+या१|Tam=yA|ChunkId=VGNF|ChunkType=head|Translit=kī
5 गई जा AUX VAUX Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part 4 a
ux:pass _ Vib=या१|Tam=yA1|ChunkId=VGNF|ChunkType=child|Translit=gaī
6 अपनी अपना PRON PRP Case=Acc|Gender=Fem|PronType=Prs 7 nmod _ V
ib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=apanī
7 रिपोर्ट रिपोर्ट NOUN NN Case=Acc|Gender=Fem|Number=Sing|Person=3 0 obl _
Vib=0_में|Tam=0|ChunkId=NP3|ChunkType=head|Translit=riporṭa
8 में में ADP PSP AdpType=Post 7 case _ ChunkId=NP3|ChunkType=chi
ld|Translit=meṁ
9 कमेटी कमेटी NOUN NN Case=Acc|Gender=Fem|Number=Sing|Person=3 11 nsubj _
Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=kameṭī
10 का का ADP PSP AdpType=Post|Case=Nom|Gender=Masc|Number=Sing 9 case _
ChunkId=NP4|ChunkType=child|Translit=kā
11 कहना कह VERB VM Case=Nom|VerbForm=Inf 7 amod _ Vib=ना|Tam=nA|Chu
nkId=VGNN|ChunkType=head|Translit=kahanā
12 है है VERB VM Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 24 rroot _ Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
13 कि कि SCONJ CC _ 24 mark _ AltTag=SCONJ-CONJ|ChunkId=CCP|ChunkType=head|Translit=ki
14 रेलवे रेलवे NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 24 nsubj _ Vib=0_को|Tam=0|ChunkId=NP5|ChunkType=head|Translit=relave
15 को को ADP PSP AdpType=Post 14 case _ ChunkId=NP5|ChunkType=child|Translit=ko
16 केंद्रीय केंद्रीय ADJ JJ Case=Nom 17 compound _ ChunkId=NP6|ChunkType=child|Translit=keṁdrīya
17 मदद मदद NOUN NN Case=Nom|Gender=Fem|Number=Sing|Person=3 24 nsubj _ Vib=0|Tam=0|ChunkId=NP6|ChunkType=head|Translit=madada
18 अब अब PRON PRP Case=Nom|PronType=Prs 24 obl _ ChunkId=NP7|ChunkType=head|Translit=aba
19 ५० ५० NUM QC NumType=Card 20 nummod _ ChunkId=NP8|ChunkType=child|Translit=50
20 फीसदी फीसदी NOUN NN Case=Acc|Gender=Fem|Number=Sing|Person=3 24 obl _ Vib=0_से|Tam=0|ChunkId=NP8|ChunkType=head|Translit=phīsadī
21 से से ADP PSP AdpType=Post 20 case _ ChunkId=NP8|ChunkType=child|Translit=se
22 भी भी PART RP _ 20 dep _ ChunkId=NP8|ChunkType=child|Translit=bhī
23 अधिक अधिक DET QF PronType=Ind 24 nsubj _ AltTag=ADJ-DET|ChunkId=JJP2|ChunkType=head|Translit=adhika
24 मिलने मिल VERB VM Gender=Fem|Number=Sing|Person=3|VerbForm=Inf|Voice=Act 11 obj _ Vib=ना_लग+या_है|Tam=nA|ChunkId=VGF2|ChunkType=head|Stype=declarative|Translit=milane
25 लगी लग AUX VAUX Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part 24 aux _ Vib=या|Tam=yA|ChunkId=VGF2|ChunkType=child|Translit=lagī
26 है है AUX VAUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 24 aux:pass _ Vib=है|Tam=hE|ChunkId=VGF2|ChunkType=child|Translit=hai
27 । । PUNCT SYM _ 12 punct _ ChunkId=BLK|ChunkType=head|Translit=.
# sent_id = dev-s177
# text = लेकिन हम लोगों का मानना है कि राष्ट्रपति, प्रधानमंत्री और मुख्य न्यायाधीश को कम से कम इससे बाहर होना चाहिए ।
1 लेकिन लेकिन CCONJ CC _ 0 cc _ ChunkId=CCP|ChunkType=head|Transl
it=lekina
2 हम हम DET DEM Case=Nom|Number=Plur|Person=1|PronType=Dem 3 det _
ChunkId=NP|ChunkType=child|Translit=hama
3 लोगों लोग NOUN NN Case=Acc|Gender=Masc|Number=Plur|Person=3 5 nsubj _
Vib=0_का|Tam=0|ChunkId=NP|ChunkType=head|Translit=logoṁ
4 का का ADP PSP AdpType=Post|Case=Nom|Gender=Masc|Number=Sing 3 case _
ChunkId=NP|ChunkType=child|Translit=kā
5 मानना मान VERB VM Case=Nom|VerbForm=Inf 1 mark _ Vib=ना|Tam=nA|Chu
nkId=VGNN|ChunkType=head|Translit=mānanā
6 है है VERB VM Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 20 rroot _ Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
7 कि कि SCONJ CC _ 20 mark _ AltTag=SCONJ-CONJ|ChunkId=CCP2|ChunkType=head|Translit=ki
8 राष्ट्रपति राष्ट्रपति PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 20 nsubj _ SpaceAfter=No|Vib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=rāṣṭrapati
9 , , PUNCT SYM _ 10 punct _ ChunkId=NP2|ChunkType=child|Translit=,
10 प्रधानमंत्री प्रधानमंत्री PROPN NNP Case=Acc|Gender=Masc|Number=Sing|Person=3 8 conj _ Vib=0|Tam=0|ChunkId=NP3|ChunkType=head|Translit=pradhānamaṁtrī
11 और और CCONJ CC _ 13 cc _ ChunkId=CCP3|ChunkType=head|Translit=aura
12 मुख्य मुख्य NOUN NNC Case=Nom|Gender=Masc|Number=Sing|Person=3 13 amod _ Vib=0|Tam=0|ChunkId=NP4|ChunkType=child|Translit=mukhya
13 न्यायाधीश न्यायाधीश NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 8 conj _ Vib=0_को|Tam=0|ChunkId=NP4|ChunkType=head|Translit=nyāyādhīśa
14 को को ADP PSP AdpType=Post 13 case _ ChunkId=NP4|ChunkType=child|Translit=ko
15 कम कम DET QF PronType=Ind 18 det _ ChunkId=NP5|ChunkType=child|Translit=kama
16 से से PART RP _ 15 dep _ ChunkId=NP5|ChunkType=child|Translit=se
17 कम कम DET QF PronType=Ind 18 det _ AltTag=ADJ-DET|ChunkId=NP5|ChunkType=head|Translit=kama
18 इससे यह PRON PRP Case=Acc,Ins|Number=Sing|Person=3|PronType=Prs 20 obl _ Vib=से|Tam=se|ChunkId=NP6|ChunkType=head|Translit=isase
19 बाहर बाहर ADV NST AdpType=Post|Case=Nom|Gender=Masc|Number=Sing|Person=3 18 case _ AltTag=ADV-NOUN|ChunkId=NP7|ChunkType=head|Translit=bāhara
20 होना हो VERB VM Gender=Masc|VerbForm=Inf|Voice=Act 5 obj _ Vib=ना_चाहिए|Tam=nA|ChunkId=VGF2|ChunkType=head|Stype=declarative|Translit=honā
21 चाहिए चाहिए AUX VAUX _ 20 aux _ Vib=0|Tam=0|ChunkId=VGF2|ChunkType=child|Translit=cāhie
22 । । PUNCT SYM _ 6 punct _ ChunkId=BLK|ChunkType=head|Translit=.
Here, there appear to be more instances. In some epochs, rroot gets predicted as much as 17 times.
Thanks! These two sentences are non-projective. My suspicion is that it might be due to the max_swap in Predict, in uuparser/arc_hybrid.py which should actually not be necessary, I used this in early debugging days but never went back to change it. Could you try setting max_swap to inf or len(sentence)*len(sentence)? In this line: https://github.com/UppsalaNLP/uuparser/blob/c0d8a8210c1032272dfad9250a765f09e128976f/uuparser/arc_hybrid.py#L287
I tried both versions:
Ok, thanks! I still think it must have something to do with non-projectivity and the use of swap but I have no idea what specifically at this point. I will take a look but it probably won't be this week, sorry :/ Theoretically actually there should be no difference between len(sentence)**2 and inf. This is because any pair of two words can only be swapped once. So it probably has something to do with the conditions for swap lines 174 to 182. There might be an edge case we did not cover?
I have been training parsers for multiple languages and observed small number of instances, where the parser predicts rroot instead of root on the dev set.
At first I thought, this could be due to typos in the training data, but I could not find any instances in any of the UD treebanks (version 2.8). Instead, I found that rroot is introduced as part of a dummy root node in read_conll in utils.py. I suppose this is not really a typo in the code, but a dummy value that is meant to be overwritten by the parser and in most cases is.
The options I set were --dynet-mem 6000 --epochs 50 --k=2 --pos-emb-size 0 --char-emb-size 100 --disable-rlmost
and I observed it in some the dev predictions starting at epoch 22 for Basque-BDT (random seed of 2) and in some of the predictions starting at the first epoch for Hindi-HDTB (random seed of 5).