EmilStenstrom / conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
MIT License
311 stars 50 forks source link

Special handling of "0:root" labels in deps column #36

Closed jbrry closed 4 years ago

jbrry commented 4 years ago

I recently upgraded from conllu 1.3.1 to 2.2 due to the latter version's ability to deal with elided tokens/copy nodes (e.g. token 8.1 below) which was addressed in https://github.com/EmilStenstrom/conllu/issues/27.

I am parsing the deps column and have a loop which iterates over the deps tuples to put the heads into a heads list and the relations into a relations list. The upgrade now includes the copy nodes which is good but now all 0:root labels are returned as a string and not a tuple which breaks my loop.

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1   Over    over    ADV RB  _   2   advmod  2:advmod    _
2   300 300 NUM CD  NumType=Card    3   nummod  3:nummod    _
3   Iraqis  Iraqis  PROPN   NNPS    Number=Plur 5   nsubj:pass  5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass _
4   are be  AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin    5   aux:pass    5:aux:pass  _
5   reported    report  VERB    VBN Tense=Past|VerbForm=Part|Voice=Pass 0   root    0:root  _
6   dead    dead    ADJ JJ  Degree=Pos  5   xcomp   5:xcomp _
7   and and CCONJ   CC  _   8   cc  8:cc|8.1:cc _
8   500 500 NUM CD  NumType=Card    5   conj    5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj _
8.1 reported    report  VERB    VBN Tense=Past|VerbForm=Part|Voice=Pass _   _   5:conj:and  CopyOf=5
9   wounded wounded ADJ JJ  Degree=Pos  8   orphan  8.1:xcomp   _
10  in  in  ADP IN  _   11  case    11:case _
11  Fallujah    Fallujah    PROPN   NNP Number=Sing 5   obl 5:obl:in    _
12  alone   alone   ADV RB  _   11  advmod  11:advmod   SpaceAfter=No
13  .   .   PUNCT   .   _   5   punct   5:punct _

I'm just wondering is this the desired behaviour? e.g. the output of deps looks like:

deps [[('advmod', 2)], [('nummod', 3)], [('nsubj:pass', 5), ('nsubj:xsubj', 6), ('nsubj:pass', 8)], [('aux:pass', 5)], '0:root', [('xcomp', 5)], [('cc', 8), ('cc', (8, '.', 1))], [('conj:and', 5), ('nsubj:pass', (8, '.', 1)), ('nsubj:xsubj', 9)], [('conj:and', 5)], [('xcomp', (8, '.', 1))], [('case', 11)], [('obl:in', 5)], [('advmod', 11)], [('punct', 5)]]

Is there any particular reason why '0:root' shouldn't be [('root', 0)]?

Thanks!

jbrry commented 4 years ago

I was able to override this behaviour by changing: https://github.com/EmilStenstrom/conllu/blob/68199a7afcbf66660aec09bab5b0a2b995937dd6/conllu/parser.py#L173

to ID_SINGLE = re.compile(r"[0-9][0-9]*")

This enables a match here: https://github.com/EmilStenstrom/conllu/blob/68199a7afcbf66660aec09bab5b0a2b995937dd6/conllu/parser.py#L201

allowing the value to be returned in the tuple format.

It seems to have solved the problem but let me know if you would advise against it, thanks!

EmilStenstrom commented 4 years ago

@Jbar-ry Excellent catch, thanks for reporting it! It's definitely a bug, I'll think of a good way to solve it and release a new version soon!

jbrry commented 4 years ago

Thank you very much @EmilStenstrom!

EmilStenstrom commented 4 years ago

@Jbar-ry Thank you! I just released 2.2.1 with fixes this bug! Install it with pip install -U conllu.

jbrry commented 4 years ago

Thanks a lot @EmilStenstrom!