adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
http://opensource.adobe.com/NLP-Cube/index.html
Apache License 2.0
550 stars 93 forks source link

Kazakh language wrong result #124

Closed pas-valkov closed 3 years ago

pas-valkov commented 3 years ago

Evaluated example from readme for kazakh language with no error, but result is wrong. English language works fine.

Expected Result

I expected a list of words with their correct normalized form but for every word normalization form consists of only 1 letter.

Actual Result

[[1 Алтай а NOUN adj Case=Gen 2 nmod:poss , 2 жерінің ж NOUN n 3 obl , 3 асты а VERB adj 4 nsubj , 4 қандай қ PRON adv 5 nsubj , 5 қазыналы қ VERB n 0 root , 6 болса б VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 cop SpaceAfter=No, 7 . . PUNCT sent 5 punct ], [1 Ағаш а VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root SpaceAfter=No, 2 . . PUNCT sent 1 punct _ SpaceAfter=No]]

Reproduction Steps

from cube.api import Cube # import the Cube object cube=Cube(verbose=True) # initialize it cube.load("kk") # select the desired language (it will auto-download the model on first run) text="Алтай жерінің асты қандай қазыналы болса. Ағаш." sentences=cube(text) # call with your own text (string) to obtain the annotations sentences

System Information

tiberiu44 commented 3 years ago

Hi @pas-valkov - I'm really sorry for the late response. I'm updating the models for 3.0 right now and hopefully it will fix the issue. Sorry again, I don't know how I missed this issue. I will let you know as soon as it's fixed.

tiberiu44 commented 3 years ago

I've just uploaded the updated model. Take into consideration that Kazakh is a really small treebank in UD and the system will not have a high accuracy.

pas-valkov commented 3 years ago

thanks for your reply! Better late than never) This issue was fixed by your new thanks!