EmilStenstrom / conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
MIT License
311 stars 50 forks source link

raised exception on invalid ID #65

Closed rahonalab closed 2 years ago

rahonalab commented 2 years ago

Hello, I know this is a malformed line, but apparently new versions of Stanza produces this output:

72  sa  sa  PRON    R   PronType=Prs|Reflex=Yes 76  expl:pv _   start_char=633673|end_char=633675
73-73   zaň _   _   _   _   _   _   _   start_char=633676|end_char=633679
73  zaň zaň PRON    PFms4   Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs 76  obj _

which raises this exception:

File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 40, in parse_tree
    return list(parse_tree_incr(StringIO(data)))
  File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 43, in parse_tree_incr
    for tokenlist in parse_incr(in_file):
  File "/usr/local/lib/python3.9/site-packages/conllu/__init__.py", line 32, in parse_incr
    yield parse_token_and_metadata(
  File "/usr/local/lib/python3.9/site-packages/conllu/parser.py", line 95, in parse_token_and_metadata
    tokens.append(parse_line(line, fields, field_parsers))
  File "/usr/local/lib/python3.9/site-packages/conllu/parser.py", line 132, in parse_line
    raise ParseException("Failed parsing field '{}': ".format(field) + str(e))
conllu.exceptions.ParseException: Failed parsing field 'id': '72-72' is not a valid ID.

Since I have to parse a lot of files, would it be possible to safely skip malformed lines instead of raising an exception i.e., terminating the script?

EmilStenstrom commented 2 years ago

@rahonalab Hi! Sorry for the late reply. The easiest way to solve this is to make a field_parser that just handles this case. Would that work for you?

rahonalab commented 2 years ago

Hello! Thank you and sorry also for the late reply. How can I make this? Thank you !

EmilStenstrom commented 2 years ago

It's right there in the docs! :) Does this make sense? https://github.com/EmilStenstrom/conllu#customizing-parsing-to-handle-strange-variations-of-conll-u=

rahonalab commented 2 years ago

Thank you for the quick reply. I have tried to specify the field_parsers argument as per the docs, but it doesn't work with the parse_tree() function. Beside this, how do you handle this case with a split function?

EmilStenstrom commented 2 years ago

No, if you're using parse_tree, this isn't going to work for you.

Ok, so thinking about this a bit more, I think this is a bug in this library. According to the spec ID:n can be a ranger, and I just assumed that they had to cover at least two tokens, because why else would you even need a separate line?

So let's fix this, and make sure these lines are returned like other range-id's are returned: (73, 73). You can then just ignore all ID:n that are tuples in your code. Does this sound like an OK solution for you?

rahonalab commented 2 years ago

thank you Emil, it's a good solution! the strange thing is that most of the time ranges are correctly parsed. For instance, this raises an exception:

# text = —Llévenme con Fernanda —alcanzó a decir.
# sent_id = 3598
1   —   —   PUNCT   fg  PunctType=Dash  7   punct   _   start_char=499721|end_char=499722
2-2 Llévenme    _   _   _   _   _   _   _   start_char=499722|end_char=499730
2   Llavenme    Llavenme    PROPN   np00000 _   7   nsubj   _   _
3   con con ADP sps00   _   4   case    _   start_char=499731|end_char=499734
4   Fernanda    Fernanda    PROPN   np00000 _   2   nmod    _   start_char=499735|end_char=499743
5   —alcanzó    —alcanzó    VERB    vmis3s0 Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   7   ccomp   _   start_char=499744|end_char=499752
6   a   a   ADP sps00   _   7   advmod  _   start_char=499753|end_char=499754
7   decir   decir   VERB    vmn0000 VerbForm=Inf    0   root    _   start_char=499755|end_char=499760
8   .   .   PUNCT   fp  PunctType=Peri  7   punct   _   start_char=499760|end_char=499761

while this does not:

# sent_id = es-dev-001-s25
# text = Limita al norte y al este con la comuna de Cama, al sur con Grono, y al occidente con Verdabbio.
1   Limita  limita  VERB    _   VerbForm=Ger    0   root    _   _
2-3 al  _   _   _   _   _   _   _   _
2   a   a   ADP _   _   4   case    _   _
3   el  el  DET _   Definite=Def|Gender=Masc|Number=Sing|PronType=Art   4   det _   _
4   norte   norte   NOUN    _   Gender=Masc|Number=Sing 1   obl _   _
5   y   y   CCONJ   _   _   8   cc  _   _
6-7 al  _   _   _   _   _   _   _   _
6   a   a   ADP _   _   8   case    _   _
7   el  el  DET _   Definite=Def|Gender=Masc|Number=Sing|PronType=Art   8   det _   _
8   este    este    PRON    _   Gender=Masc|Number=Sing|PronType=Dem    4   conj    _   _
9   con con ADP _   _   11  case    _   _
10  la  el  DET _   Definite=Def|Gender=Fem|Number=Sing|PronType=Art    11  det _   _
11  comuna  comuna  NOUN    _   Gender=Fem|Number=Sing  1   obl _   _
12  de  de  ADP _   _   13  case    _   _
13  Cama    cama    PROPN   _   _   11  nmod    _   SpaceAfter=No
14  ,   ,   PUNCT   _   _   17  punct   _   _
15-16   al  _   _   _   _   _   _   _   _
15  a   a   ADP _   _   17  case    _   _
16  el  el  DET _   Definite=Def|Gender=Masc|Number=Sing|PronType=Art   17  det _   _
17  sur sur NOUN    _   Gender=Masc|Number=Sing 1   conj    _   _
18  con con ADP _   _   19  case    _   _
19  Grono   grono   PROPN   _   _   17  nmod    _   SpaceAfter=No
20  ,   ,   PUNCT   _   _   24  punct   _   _
21  y   y   CCONJ   _   _   24  cc  _   _
22-23   al  _   _   _   _   _   _   _   _
22  a   a   ADP _   _   24  case    _   _
23  el  el  DET _   Definite=Def|Gender=Masc|Number=Sing|PronType=Art   24  det _   _
24  occidente   occidente   NOUN    _   Gender=Masc|Number=Sing 17  conj    _   _
25  con con ADP _   _   26  case    _   _
26  Verdabbio   verdabbio   PROPN   _   _   24  nmod    _   SpaceAfter=No
27  .   .   PUNCT   _   _   1   punct   _   _

Probably it doesn't like that the numbers in the range are the same, what do you think?

EmilStenstrom commented 2 years ago

Hi! Yes, the code intentionally didn't allow ranges with the same number as from and to.

Anyways. I've now released version 4.5.2 of conllu that allows these kinds of IDs. Could you try installing the latest version to see if this works for you?

rahonalab commented 2 years ago

yes, it works! could you please point out the code you have modified? just curious :-)

EmilStenstrom commented 2 years ago

Here's a diff of the changes. In the code, it was literally one character! :)