adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
http://opensource.adobe.com/NLP-Cube/index.html
Apache License 2.0
550 stars 93 forks source link

Kazakh model of nlpcube 0.3.1.0 does not tokenize well #131

Closed KoichiYasuoka closed 3 years ago

KoichiYasuoka commented 3 years ago

Describe the bug Tokenization of Kazakh model does not work well

To Reproduce

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("kk")
>>> doc=nlp("Тәннен жан артық еді.")
>>> print(doc)
1   Тәнненжанартық  тәнненжанартық  ADJ adj _   0   root    _   _
2   еді.    ед  PUNCT   sent    _   1   cop _   _

Expected behavior Should be tokenized into five lines: "Тәннен" "жан" "артық" "еді" and period.

tiberiu44 commented 3 years ago

Thank you for letting us know.

I've update the tokenization and parsing models. This issue is currently fixed:

In [4]: cube("Тәннен жан артық еді.")
Out[4]:
1   Тәннен  тән PRON    prn Case=Dat|Number=Sing|Person=1|PronType=Prs  3   obl _   _
2   жан жан NOUN    n   Case=Nom    3   nsubj   _   _
3   артық   артық   ADJ adj _   0   root    _   _
4   еді е   AUX cop Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   3   cop _   _
5   .   у.  PUNCT   sent    _   3   punct   _   _

Lemmatization is still problematic, but the dataset is small and I don't know if it will get better. Still, I will try retraining the lemmatizer as well.

KoichiYasuoka commented 3 years ago

Thank you @tiberiu44 for the improvement. Stanza people are also worried about Kazakh https://github.com/stanfordnlp/stanza/issues/723 so please give them hints if you have time.