Closed alisafaya closed 7 years ago
Language identification mechanism is designed for detecting language of sentences, paragraphs or documents not individual words. Usually you need >20 characters of text to have good results. If you want to eliminate non Turkish words, perhaps it is better to use the morphological analysis mechanism. However in that case some proper nouns that does not exist in Zemberek root dictionary will be also be eliminated.
Also if you explain what do you want to achieve by eliminating Turkish words we may have suggestions.
of course, I want to train word2vec model using turkish version of wikipedia .
Then I suggest not eliminating non Turkish words. There will not be much adverse effects of foreign words that occurred in Wikipedia. If problem is sparsity (eg. too much words) then apply lemmatization. Also, Fasttext may be more suitable for languages like Turkish. Fasttext already provides pre-trained vectors for Turkish Wikipedia.
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Lastly, I am working on porting Fasttext to Zemberek, hopefully in the next release you can use it from Java directly.
Thanks for your interest .
LangId = LanguageIdentifier.fromInternalModelGroup("tr_group");
LangId.containsLanguage("ahmet","tr",10);
LangId.containsLanguage("kitap","tr",10);
these two returnedfalse
. how can i detect Turkish words from other languages' words ? i am doing morphological analysis over wikipedia's turkish version, and i want to get rid of non-Turkish words .