Language Identifier results

ahmetaa / zemberek-nlp

NLP tools for Turkish.

Other

1.14k stars 207 forks source link

Language Identifier results #132

Closed alisafaya closed 7 years ago

alisafaya commented 7 years ago

LangId = LanguageIdentifier.fromInternalModelGroup("tr_group"); LangId.containsLanguage("ahmet","tr",10); LangId.containsLanguage("kitap","tr",10); these two returned false . how can i detect Turkish words from other languages' words ? i am doing morphological analysis over wikipedia's turkish version, and i want to get rid of non-Turkish words .

ahmetaa commented 7 years ago

Language identification mechanism is designed for detecting language of sentences, paragraphs or documents not individual words. Usually you need >20 characters of text to have good results. If you want to eliminate non Turkish words, perhaps it is better to use the morphological analysis mechanism. However in that case some proper nouns that does not exist in Zemberek root dictionary will be also be eliminated.

ahmetaa commented 7 years ago

Also if you explain what do you want to achieve by eliminating Turkish words we may have suggestions.

alisafaya commented 7 years ago

of course, I want to train word2vec model using turkish version of wikipedia .

ahmetaa commented 7 years ago

Then I suggest not eliminating non Turkish words. There will not be much adverse effects of foreign words that occurred in Wikipedia. If problem is sparsity (eg. too much words) then apply lemmatization. Also, Fasttext may be more suitable for languages like Turkish. Fasttext already provides pre-trained vectors for Turkish Wikipedia.

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Lastly, I am working on porting Fasttext to Zemberek, hopefully in the next release you can use it from Java directly.

alisafaya commented 7 years ago

Thanks for your interest .