cestwc / word2vec-gensim-wiki-english

Train your own word2vec embeddings using a wiki english dataset
MIT License
0 stars 0 forks source link

word2vec-gensim-wiki-english

Train your own word2vec embeddings using a wiki english dataset

You may want to have pretrained word2vec vectors, and this repository may just be a good idea for you. However, what makes it tricky is that there isn't pretrained vectors using the wiki-english dataset. What makes it even more tricky is that the given usage code, though works for text8 dataset, cannot train vectors on the wiki-english-20171001 dataset.

We have tested it several times, and the most probable reason is that the data structure of wiki-english-20171001 is slightly different from the rest. It contains many sections, rather than just tokenized sentences.

To get it work, we refer to the IterableWrapper provided by this post, and apply it on the wiki-english dataset.

Usage

To see how fast your progress is, you'd better configure you logging like this

import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')

Then we load the dataset as introduced in this repository

import gensim.downloader as api
from gensim.models import Word2Vec

wiki = api.load("wiki-english-20171001")

The key idea is HERE! you wrap the dataset in a way that a model from gensim can handle

def f(a):
    # The function I want to apply to everything
    return ''.join(x.lower() for s in a['section_texts'] for x in s+' ' if x.isalpha() or x == ' ').strip().split()

class IterableWrapper:
    def __init__(self, iterable):
        self.iterable = iterable
        self.iterator = None
    def __iter__(self):
        # This is not ideal as it doesn't allow two different iterators at the same time
        self.iterator = iter(self.iterable)
        return self
    def __next__(self):
        return f(next(self.iterator))

Finally, you train your vectors as usual and test how good they are

dataset = IterableWrapper(wiki)
model = Word2Vec(dataset, size=128, window=15, iter=50, sg=1, min_count=150)
model.save("word2vec-wiki.model")
model.wv.save_word2vec_format('word2vec-wiki.txt')
print(model.wv.most_similar('cat'))

Some other usage

If you feel like training a Phrases model from this wiki-english dataset, the usage will be like

from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

dataset = IterableWrapper(wiki)
phrases = Phrases(dataset, threshold=1, min_count=1)

# Export a FrozenPhrases object that is more efficient but doesn't allow further training.
frozen_phrases = phrases.freeze()
frozen_phrases.save('frozen_phrases')

# test it
print(frozen_phrases['For example, this is a sentence for you to test it out'])

see here for more details on training phrases.