Closed KoichiYasuoka closed 3 years ago
This should be an easy fix. I will update the models soon.
This (along with some other tokenization issues) is now fixed with the latest model:
In [4]: cube("Δεν υπάρχει βασιλικός δρόμος.")
Out[4]:
1 Δεν δεν PART PART _ 2 advmod _ _
2 υπάρχει υπάρχω VERB VERB Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ _
3 βασιλικός βασιλικός ADJ ADJ Case=Nom|Gender=Masc|Number=Sing 4 amod _ _
4 δρόμος δρόμος NOUN NOUN Case=Nom|Gender=Masc|Number=Sing 2 nsubj _ _
5 . . PUNCT PUNCT _ 2 punct _ _
The dataset is relatively small, so there will still be many cases where tokenization fails.
I've confirmed now Greek model works well. Thank you @tiberiu44 for quick response.
Describe the bug Multiword segmentation of Greek model sometimes separates punctuation
To Reproduce
Expected behavior Neither "." nor "Δεν" is treated as multiword.