Closed schwittlick closed 7 years ago
maybe download wikipedia data & train on combined text: textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim
Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: http://word2vec.googlecode.com/svn/trunk/questions-words.txt.
Gensim support the same evaluation set, in exactly the same format:
model.accuracy('/tmp/questions-words.txt')
2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)
I downloaded the wiki dataset 2 days ago. Check the data folder on the drive. There is also an older dump
i trained on the wikipedia data (took 6h!) but couldn't load the model. i read online some people had the same issue and resolved it by saving it in the word2vec binary format. doing that now: https://github.com/mrzl/ECO/commit/756554fd359abb0db2a60a94786c8c2e1ac42071
finally it worked -.- documented some non-expressive stats here: https://github.com/mrzl/ECO/wiki/Word2Vec-Training
these were generated via https://github.com/mrzl/ECO/blob/master/src/python/nlp/inspect_word2vec_model.py
the final model with our combined corpae is here
/mnt/drive/data/eco/word2vec_models/wiki_plus_v3_valid_combined.txt_numpy.w2vmodel
seems like it's in gensim's develop branch as of 2. oct: https://github.com/RaRe-Technologies/gensim/pull/900
let's see if this functionality is in the release version that's intsalled via pip as well..
some blog bost about this has been kept up-to-date with this here: http://rutumulkar.com/blog/2015/word2vec