Closed rahonalab closed 2 years ago
Great catch, this is something we should fix. Are you able to write up a PR?
@rahonalab Do you have some sample CoNLL-U that corresponds to this sentence I can test with?
Hello and thanks for your reply! Here's a sample containing the tokens:
I finally got to looking at this. The first thing I did was to write a test of something that I didn't thought worked. But parsing seems to work just fine. Could you send over a test that fails so I can troubleshoot further?
class TestParseBracketsInToken(unittest.TestCase):
def test_bracketsintoken(self):
data = dedent("""\
# sent_id = 896
# text = Øvelse nr. 1 Antag at a og b er to reelle tal og at 0<a<b.
1 Øvelse øvelse NOUN _ Definite=Ind|Gender=Com|Number=Sing 2 nmod _ _
2 nr. nummer NOUN _ Definite=Ind|Gender=Neut|Number=Sing 0 root _ _
3 1 1 NUM _ NumType=Card 2 nmod _ SpacesAfter=\s\s
4 Antag Antag ADP _ AdpType=Prep 12 case _ _
5 at at SCONJ _ _ 12 mark _ _
6 a a NOUN _ Definite=Ind|Gender=Com|Number=Sing 12 nsubj _ _
7 og og CCONJ _ _ 8 cc _ _
8 b b NOUN _ Definite=Ind|Gender=Com|Number=Sing 6 conj _ _
9 er være AUX _ Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act 12 cop _ _
10 to to NUM _ NumType=Card 12 nummod _ _
11 reelle reel ADJ _ Degree=Pos|Number=Plur 12 amod _ _
12 tal tal NOUN _ Definite=Ind|Gender=Neut|Number=Plur 2 nmod _ _
13 og og CCONJ _ _ 15 cc _ _
14 at at X _ _ 15 amod _ _
15 0<a<b 0<a<be NOUN _ Definite=Ind|Gender=Neut|Number=Plur 12 conj _ SpaceAfter=No
16 . . PUNCT _ _ 2 punct _ _
""")
sentences = parse(data)
self.assertEqual(sentences[0][14]["form"], "0<a<b")
There are some issues with this type error, as it mistakes a token containing
<
or>
as a tag. For instance, the Danish sentence:Antag at a og b er to reelle tal og at
0<a<b
.raises the TypeError.
Moreover, what about actual tags like
<p>
that can be found quoted in several texts?