alex-tifrea / poincare_glove

Implementation of the "Poincare Glove: Hyperbolic word embeddings" paper
GNU Lesser General Public License v2.1
83 stars 20 forks source link

Code for our ICLR'19 submission on Poincare GloVe. This repo is a fork of the gensim repository.

Installation

To set up the environment for the Poincare GloVe code follow the following steps:

This version has been tested under Python 3.6.

Documentation

For training and evaluating a model, we use run_{word2vec, glove}.sh scripts. Their usage is similar, so the following will focus only on GloVe.

Training: To train a Vanilla (Euclidean) GloVe model, run the following:

./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --chunksize 1000 --lr 0.05 --bias --size 100

The root should be the folder that contains the repository folder as well (the folder in which the git clone command was run). The coocc_file is a binary file that contains co-occurrence triples in the format generated by the GloVe preprocessing scripts (https://github.com/stanfordnlp/GloVe/blob/master/src/cooccur.c). The vocab_file is a text file that contains the vocabulary and should have a format similar to the one generated by https://github.com/stanfordnlp/GloVe/blob/master/src/vocab_count.c.

For Poincare embeddings use something similar to the following:

./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --dist_func cosh-dist-sq

The parameter dist_func specifies what function h to choose (where h is the notation used in the paper). To see what each of the possible dist_func options is doing, consult the code in glove_inner.pyx (function mix_poincare_similarity and poincare_similarity)

For a Cartesian product of Poincare balls, the command changes a little bit:

./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.05 --poincare 1 --bias --size 100 --mix --num_embs 50 --dist_func cosh-dist-sq

Here, num_embs specifies in how many small dimensional embeddings the large vector of length size will be split.

A number of training options are available:

Evaluating: Evaluation for all model types works in a similar way. The command format is the following:

./run_glove.sh --eval --restrict_vocab 200000 --root path/to/root --model_file path/to/saved/model

path/to/saved/model should point to the location of the model that is being evaluated.

Some of the options that are allowed for evaluation are the following:

Logs: During training/evaluation, some important information is saved in logs that can later be used to generate plots or to debug and investigate the characteristics of the trained embedding models.

In the folder ROOT/logs the model will save the progress that is recorded during training, including information about the epoch loss, the vector norms and the scores on some analogy and similarity benchmarks.

In ROOT/eval_logs you can find a persistant copy of the output generated by the evaluation of a model.

Pre-trained embeddings

Some of the embedding models presented in the paper are made available here. The embeddings have been trained on a dump of the English Wikipedia containing 1.4 billion tokens. For more details about the training setup, please consult the experiments section of our paper.

References

If you find this code useful for your research, please cite the following paper:

@inproceedings{
  tifrea2018poincare,
  title={Poincare Glove: Hyperbolic Word Embeddings},
  author={Alexandru Tifrea and Gary Becigneul and Octavian-Eugen Ganea},
  booktitle={International Conference on Learning Representations},
  year={2019},
  url={https://openreview.net/forum?id=Ske5r3AqK7},
}