Closed KoichiYasuoka closed 3 years ago
Thank you for letting us know.
I've update the tokenization and parsing models. This issue is currently fixed:
In [4]: cube("Тәннен жан артық еді.")
Out[4]:
1 Тәннен тән PRON prn Case=Dat|Number=Sing|Person=1|PronType=Prs 3 obl _ _
2 жан жан NOUN n Case=Nom 3 nsubj _ _
3 артық артық ADJ adj _ 0 root _ _
4 еді е AUX cop Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 3 cop _ _
5 . у. PUNCT sent _ 3 punct _ _
Lemmatization is still problematic, but the dataset is small and I don't know if it will get better. Still, I will try retraining the lemmatizer as well.
Thank you @tiberiu44 for the improvement. Stanza people are also worried about Kazakh https://github.com/stanfordnlp/stanza/issues/723 so please give them hints if you have time.
Describe the bug Tokenization of Kazakh model does not work well
To Reproduce
Expected behavior Should be tokenized into five lines: "Тәннен" "жан" "артық" "еді" and period.