fh295 / SentenceRepresentation

124 stars 22 forks source link

SentenceRepresentation

This code acompanies the paper 'Learning Sentence Representations from Unlabelled Data' Felix Hill, KyungHyun Cho and Anna Korhonen 2016.

To train a FastSent model

Move to the FastSent directory. The code is based on a small change to the gensim code. You can find out more about gensim at https://radimrehurek.com/gensim/, it is really good, and the contributors to gensim deserve 99% of the credit for this implementation.

To train a fastsent model, just run ./train_model.sh. The script checks out a particular version of gensim (the version on top of which we made the change) and copies in our modifications.

For things to work you will need to check the following:

The corpus

This must be a plaintext file with each sentence on a new line. It's no problem to have full-stops at the end of each sentence. Implement any pre-processing on this file before training. We used the Toronto-Books Corpus (http://www.cs.toronto.edu/~mbweb/) as is (the only pre-processing was lower-casing).

To train an SDAE

Move to the SDAE directory.

Pre-trained embeddings

To train a model with pre-trained word embeddings (which are mapped into the RNN via a learned mapping but not updated during training) you need to put your word embeddings in the following form.

{'theword': numpy.array([a word embedding],dtype=float32) for 'theword' in your vocabulary}

This object (a python dictionary) needs to be saved as a pickle file (using cPickle). Then, in train_book.py, set use_preemb to True and set embeddings to this file.