PrincetonML / SIF

sentence embedding by Smooth Inverse Frequency weighting scheme
MIT License
1.08k stars 306 forks source link

SIF

This is the code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

The code is written in python and requires numpy, scipy, pickle, sklearn, theano and the lasagne library. Some functions/classes are based on the code of John Wieting for the paper "Towards Universal Paraphrastic Sentence Embeddings" (Thanks John!). The example data sets are also preprocessed using the code there.

Install

To install all dependencies virtualenv is suggested:

$ virtualenv .env
$ . .env/bin/activate
$ pip install -r requirements.txt 

Get started

To get started, cd into the directory examples/ and run demo.sh. It downloads the pretrained GloVe word embeddings, and then runs the scripts:

Check these files to see the options.

Source code

The code is separated into the following parts:

References

For technical details and full experimental results, see the paper.

@article{arora2017asimple, 
    author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma}, 
    title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings}, 
    booktitle = {International Conference on Learning Representations},
    year = {2017}
}