Train your own word2vec embeddings using a wiki english dataset
You may want to have pretrained word2vec vectors, and this repository may just be a good idea for you. However, what makes it tricky is that there isn't pretrained vectors using the wiki-english dataset. What makes it even more tricky is that the given usage code, though works for text8
dataset, cannot train vectors on the wiki-english-20171001
dataset.
We have tested it several times, and the most probable reason is that the data structure of wiki-english-20171001
is slightly different from the rest. It contains many sections, rather than just tokenized sentences.
To get it work, we refer to the IterableWrapper
provided by this post, and apply it on the wiki-english dataset.
To see how fast your progress is, you'd better configure you logging like this
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')
Then we load the dataset as introduced in this repository
import gensim.downloader as api
from gensim.models import Word2Vec
wiki = api.load("wiki-english-20171001")
The key idea is HERE! you wrap the dataset in a way that a model from gensim can handle
def f(a):
# The function I want to apply to everything
return ''.join(x.lower() for s in a['section_texts'] for x in s+' ' if x.isalpha() or x == ' ').strip().split()
class IterableWrapper:
def __init__(self, iterable):
self.iterable = iterable
self.iterator = None
def __iter__(self):
# This is not ideal as it doesn't allow two different iterators at the same time
self.iterator = iter(self.iterable)
return self
def __next__(self):
return f(next(self.iterator))
Finally, you train your vectors as usual and test how good they are
dataset = IterableWrapper(wiki)
model = Word2Vec(dataset, size=128, window=15, iter=50, sg=1, min_count=150)
model.save("word2vec-wiki.model")
model.wv.save_word2vec_format('word2vec-wiki.txt')
print(model.wv.most_similar('cat'))
If you feel like training a Phrases
model from this wiki-english dataset, the usage will be like
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
dataset = IterableWrapper(wiki)
phrases = Phrases(dataset, threshold=1, min_count=1)
# Export a FrozenPhrases object that is more efficient but doesn't allow further training.
frozen_phrases = phrases.freeze()
frozen_phrases.save('frozen_phrases')
# test it
print(frozen_phrases['For example, this is a sentence for you to test it out'])
see here for more details on training phrases.