Closed rahonalab closed 2 years ago
@rahonalab Hi! Sorry for the late reply. The easiest way to solve this is to make a field_parser that just handles this case. Would that work for you?
Hello! Thank you and sorry also for the late reply. How can I make this? Thank you !
It's right there in the docs! :) Does this make sense? https://github.com/EmilStenstrom/conllu#customizing-parsing-to-handle-strange-variations-of-conll-u=
Thank you for the quick reply. I have tried to specify the field_parsers argument as per the docs, but it doesn't work with the parse_tree() function. Beside this, how do you handle this case with a split function?
No, if you're using parse_tree, this isn't going to work for you.
Ok, so thinking about this a bit more, I think this is a bug in this library. According to the spec ID:n can be a ranger, and I just assumed that they had to cover at least two tokens, because why else would you even need a separate line?
So let's fix this, and make sure these lines are returned like other range-id's are returned: (73, 73). You can then just ignore all ID:n that are tuples in your code. Does this sound like an OK solution for you?
thank you Emil, it's a good solution! the strange thing is that most of the time ranges are correctly parsed. For instance, this raises an exception:
# text = —Llévenme con Fernanda —alcanzó a decir.
# sent_id = 3598
1 — — PUNCT fg PunctType=Dash 7 punct _ start_char=499721|end_char=499722
2-2 Llévenme _ _ _ _ _ _ _ start_char=499722|end_char=499730
2 Llavenme Llavenme PROPN np00000 _ 7 nsubj _ _
3 con con ADP sps00 _ 4 case _ start_char=499731|end_char=499734
4 Fernanda Fernanda PROPN np00000 _ 2 nmod _ start_char=499735|end_char=499743
5 —alcanzó —alcanzó VERB vmis3s0 Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 7 ccomp _ start_char=499744|end_char=499752
6 a a ADP sps00 _ 7 advmod _ start_char=499753|end_char=499754
7 decir decir VERB vmn0000 VerbForm=Inf 0 root _ start_char=499755|end_char=499760
8 . . PUNCT fp PunctType=Peri 7 punct _ start_char=499760|end_char=499761
while this does not:
# sent_id = es-dev-001-s25
# text = Limita al norte y al este con la comuna de Cama, al sur con Grono, y al occidente con Verdabbio.
1 Limita limita VERB _ VerbForm=Ger 0 root _ _
2-3 al _ _ _ _ _ _ _ _
2 a a ADP _ _ 4 case _ _
3 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 4 det _ _
4 norte norte NOUN _ Gender=Masc|Number=Sing 1 obl _ _
5 y y CCONJ _ _ 8 cc _ _
6-7 al _ _ _ _ _ _ _ _
6 a a ADP _ _ 8 case _ _
7 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 8 det _ _
8 este este PRON _ Gender=Masc|Number=Sing|PronType=Dem 4 conj _ _
9 con con ADP _ _ 11 case _ _
10 la el DET _ Definite=Def|Gender=Fem|Number=Sing|PronType=Art 11 det _ _
11 comuna comuna NOUN _ Gender=Fem|Number=Sing 1 obl _ _
12 de de ADP _ _ 13 case _ _
13 Cama cama PROPN _ _ 11 nmod _ SpaceAfter=No
14 , , PUNCT _ _ 17 punct _ _
15-16 al _ _ _ _ _ _ _ _
15 a a ADP _ _ 17 case _ _
16 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 17 det _ _
17 sur sur NOUN _ Gender=Masc|Number=Sing 1 conj _ _
18 con con ADP _ _ 19 case _ _
19 Grono grono PROPN _ _ 17 nmod _ SpaceAfter=No
20 , , PUNCT _ _ 24 punct _ _
21 y y CCONJ _ _ 24 cc _ _
22-23 al _ _ _ _ _ _ _ _
22 a a ADP _ _ 24 case _ _
23 el el DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 24 det _ _
24 occidente occidente NOUN _ Gender=Masc|Number=Sing 17 conj _ _
25 con con ADP _ _ 26 case _ _
26 Verdabbio verdabbio PROPN _ _ 24 nmod _ SpaceAfter=No
27 . . PUNCT _ _ 1 punct _ _
Probably it doesn't like that the numbers in the range are the same, what do you think?
Hi! Yes, the code intentionally didn't allow ranges with the same number as from and to.
Anyways. I've now released version 4.5.2 of conllu that allows these kinds of IDs. Could you try installing the latest version to see if this works for you?
yes, it works! could you please point out the code you have modified? just curious :-)
Here's a diff of the changes. In the code, it was literally one character! :)
Hello, I know this is a malformed line, but apparently new versions of Stanza produces this output:
which raises this exception:
Since I have to parse a lot of files, would it be possible to safely skip malformed lines instead of raising an exception i.e., terminating the script?