adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
http://opensource.adobe.com/NLP-Cube/index.html
Apache License 2.0
552 stars 93 forks source link

Kazakh language wrong result #125

Closed pas-valkov closed 3 years ago

pas-valkov commented 3 years ago

Evaluated example from readme for kazakh language with no error, but result is wrong. English language works fine.

Expected Result

I expected a list of words with their correct normalized form but for every word normalization form consists of only 1 letter.

Actual Result

[[1 Алтай а NOUN adj Case=Gen 2 nmod:poss , 2 жерінің ж NOUN n 3 obl , 3 асты а VERB adj 4 nsubj , 4 қандай қ PRON adv 5 nsubj , 5 қазыналы қ VERB n 0 root , 6 болса б VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 cop SpaceAfter=No, 7 . . PUNCT sent 5 punct ], [1 Ағаш а VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root SpaceAfter=No, 2 . . PUNCT sent 1 punct _ SpaceAfter=No]]

Reproduction Steps

from cube.api import Cube # import the Cube object cube=Cube(verbose=True) # initialize it cube.load("kk") # select the desired language (it will auto-download the model on first run) text="Алтай жерінің асты қандай қазыналы болса. Ағаш." sentences=cube(text) # call with your own text (string) to obtain the annotations sentences

System Information

tiberiu44 commented 3 years ago

Closing as duplicate