UniversalDependencies / UD_Finnish-TDT

Finnish data
Other
8 stars 4 forks source link

Invalid word indices #5

Closed bheinzerling closed 5 years ago

bheinzerling commented 5 years ago

The conllu parser I'm using complains about invalid IDs when trying to read fi_tdt-ud-train.conllu

It looks like the following 5 word indices do not conform to the conllu format, since word indices should start with 1:

fi_tdt-ud-train.conllu:15617:0.1        Laitoin laittaa VERB    _       Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin|Voice=Act _       _       _       _
fi_tdt-ud-train.conllu:16633:0.1        Laita   laittaa VERB    _       Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Act    _       _       _       _
fi_tdt-ud-train.conllu:43686:0.1        Miettii miettiä VERB    _       Mood=Ind|Number=Sing|Person=0|Tense=Pres|VerbForm=Fin|Voice=Act _       _       _       _
fi_tdt-ud-train.conllu:59571:0.1        Otetaan ottaa   VERB    _       Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass     _       _       _       _
fi_tdt-ud-train.conllu:59690:0.1        Harjasi harjata VERB    _       Mood=Ind|Number=Sing|Person=0|Tense=Past|VerbForm=Fin|Voice=Act _       _       _       _

(grep output with line numbers)

dan-zeman commented 5 years ago

It is actually an imperfection of the documentation of the format. The "start with 1" rule applies to integer numbers for real words, but not for decimal ids of empty nodes. Thus ID=0.1 is OK, while ID=0 would be an error.

Later in the format documentation, it says: It is possible to insert one or more empty nodes indexed i.1, i.2, etc. immediately after a word with index i (where i = 0 for sentence-initial empty nodes).

I think there should be a short warning about the exception in the beginning of the document. Will look into it.

dan-zeman commented 5 years ago

Fixed in https://github.com/UniversalDependencies/docs/commit/c2b6b095f42c8f3c5ed298e09eaa315539061fde