Kyubyong / wordvectors

Pre-trained word vectors of 30+ languages
MIT License
2.22k stars 392 forks source link
fasttext language vector word2vec

Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.

Requirements

Background / References

Work Flow

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

Language ISO 639-1 Vector Size Corpus Size Vocabulary Size
Bengali (w) | Bengali (f) bn 300 147M 10059 negative sampling
Catalan (w) | Catalan (f) ca 300 967M 50013 negative sampling
Chinese (w) | Chinese (f) zh 300 1G 50101 negative sampling
Danish (w) | Danish (f) da 300 295M 30134 negative sampling
Dutch (w) | Dutch (f) nl 300 1G 50160 negative sampling
Esperanto (w) | Esperanto (f) eo 300 1G 50597 negative sampling
Finnish (w) | Finnish (f) fi 300 467M 30029 negative sampling
French (w) | French (f) fr 300 1G 50130 negative sampling
German (w) | German (f) de 300 1G 50006 negative sampling
Hindi (w) | Hindi (f) hi 300 323M 30393 negative sampling
Hungarian (w) | Hungarian (f) hu 300 692M 40122 negative sampling
Indonesian (w) | Indonesian (f) id 300 402M 30048 negative sampling
Italian (w) | Italian (f) it 300 1G 50031 negative sampling
Japanese (w) | Japanese (f) ja 300 1G 50108 negative sampling
Javanese (w) | Javanese (f) jv 100 31M 10019 negative sampling
Korean (w) | Korean (f) ko 200 339M 30185 negative sampling
Malay (w) | Malay (f) ms 100 173M 10010 negative sampling
Norwegian (w) | Norwegian (f) no 300 1G 50209 negative sampling
Norwegian Nynorsk (w) | Norwegian Nynorsk (f) nn 100 114M 10036 negative sampling
Polish (w) | Polish (f) pl 300 1G 50035 negative sampling
Portuguese (w) | Portuguese (f) pt 300 1G 50246 negative sampling
Russian (w) | Russian (f) ru 300 1G 50102 negative sampling
Spanish (w) | Spanish (f) es 300 1G 50003 negative sampling
Swahili (w) | Swahili (f) sw 100 24M 10222 negative sampling
Swedish (w) | Swedish (f) sv 300 1G 50052 negative sampling
Tagalog (w) | Tagalog (f) tl 100 38M 10068 negative sampling
Thai (w) | Thai (f) th 300 696M 30225 negative sampling
Turkish (w) | Turkish (f) tr 200 370M 30036 negative sampling
Vietnamese (w) | Vietnamese (f) vi 100 74M 10087 negative sampling