EmilStenstrom / conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
MIT License
311 stars 50 forks source link

TypeError: initial_value must be str or None, not Tag #60

Closed rahonalab closed 2 years ago

rahonalab commented 2 years ago

There are some issues with this type error, as it mistakes a token containing < or > as a tag. For instance, the Danish sentence:

Antag at a og b er to reelle tal og at 0<a<b.

raises the TypeError.

Moreover, what about actual tags like <p> that can be found quoted in several texts?

EmilStenstrom commented 2 years ago

Great catch, this is something we should fix. Are you able to write up a PR?

EmilStenstrom commented 2 years ago

@rahonalab Do you have some sample CoNLL-U that corresponds to this sentence I can test with?

rahonalab commented 2 years ago

Hello and thanks for your reply! Here's a sample containing the tokens:

danish-ex.conlllu.zip

EmilStenstrom commented 2 years ago

I finally got to looking at this. The first thing I did was to write a test of something that I didn't thought worked. But parsing seems to work just fine. Could you send over a test that fails so I can troubleshoot further?

class TestParseBracketsInToken(unittest.TestCase):
    def test_bracketsintoken(self):
        data = dedent("""\
            # sent_id = 896
            # text = Øvelse nr. 1 Antag at a og b er to reelle tal og at 0<a<b.
            1   Øvelse  øvelse  NOUN    _   Definite=Ind|Gender=Com|Number=Sing 2   nmod    _   _
            2   nr. nummer  NOUN    _   Definite=Ind|Gender=Neut|Number=Sing    0   root    _   _
            3   1   1   NUM _   NumType=Card    2   nmod    _   SpacesAfter=\s\s
            4   Antag   Antag   ADP _   AdpType=Prep    12  case    _   _
            5   at  at  SCONJ   _   _   12  mark    _   _
            6   a   a   NOUN    _   Definite=Ind|Gender=Com|Number=Sing 12  nsubj   _   _
            7   og  og  CCONJ   _   _   8   cc  _   _
            8   b   b   NOUN    _   Definite=Ind|Gender=Com|Number=Sing 6   conj    _   _
            9   er  være    AUX _   Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act  12  cop _   _
            10  to  to  NUM _   NumType=Card    12  nummod  _   _
            11  reelle  reel    ADJ _   Degree=Pos|Number=Plur  12  amod    _   _
            12  tal tal NOUN    _   Definite=Ind|Gender=Neut|Number=Plur    2   nmod    _   _
            13  og  og  CCONJ   _   _   15  cc  _   _
            14  at  at  X   _   _   15  amod    _   _
            15  0<a<b   0<a<be  NOUN    _   Definite=Ind|Gender=Neut|Number=Plur    12  conj    _   SpaceAfter=No
            16  .   .   PUNCT   _   _   2   punct   _   _

        """)

        sentences = parse(data)
        self.assertEqual(sentences[0][14]["form"], "0<a<b")