Schwittleymani / ECO

Electronic Chaos Oracle
https://schwittlick.net/eco
Apache License 2.0
6 stars 1 forks source link

Combine Google/Wiki model with our own model #176

Closed schwittlick closed 7 years ago

schwittlick commented 7 years ago

seems like it's in gensim's develop branch as of 2. oct: https://github.com/RaRe-Technologies/gensim/pull/900

let's see if this functionality is in the release version that's intsalled via pip as well..

some blog bost about this has been kept up-to-date with this here: http://rutumulkar.com/blog/2015/word2vec

schwittlick commented 7 years ago

maybe download wikipedia data & train on combined text: textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim

schwittlick commented 7 years ago

Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: http://word2vec.googlecode.com/svn/trunk/questions-words.txt.

Gensim support the same evaluation set, in exactly the same format:

model.accuracy('/tmp/questions-words.txt')

2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)

transfluxus commented 7 years ago

I downloaded the wiki dataset 2 days ago. Check the data folder on the drive. There is also an older dump

schwittlick commented 7 years ago

i trained on the wikipedia data (took 6h!) but couldn't load the model. i read online some people had the same issue and resolved it by saving it in the word2vec binary format. doing that now: https://github.com/mrzl/ECO/commit/756554fd359abb0db2a60a94786c8c2e1ac42071

schwittlick commented 7 years ago

finally it worked -.- documented some non-expressive stats here: https://github.com/mrzl/ECO/wiki/Word2Vec-Training

these were generated via https://github.com/mrzl/ECO/blob/master/src/python/nlp/inspect_word2vec_model.py

the final model with our combined corpae is here

/mnt/drive/data/eco/word2vec_models/wiki_plus_v3_valid_combined.txt_numpy.w2vmodel