facebookresearch / SentEval

A python tool for evaluating the quality of sentence embeddings.
Other
2.09k stars 309 forks source link

SentEval: evaluation toolkit for sentence embeddings

SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. SentEval currently includes 17 downstream tasks. We also include a suite of 10 probing tasks which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.

(04/22) SentEval new tasks: Added probing tasks for evaluating what linguistic properties are encoded in sentence embeddings

(10/04) SentEval example scripts for three sentence encoders: SkipThought-LN/GenSen/Google-USE

Dependencies

This code is written in python. The dependencies are:

Transfer tasks

Downstream tasks

SentEval allows you to evaluate your sentence embeddings as features for the following downstream tasks:

Task Type #train #test needs_train set_classifier
MR movie review 11k 11k 1 1
CR product review 4k 4k 1 1
SUBJ subjectivity status 10k 10k 1 1
MPQA opinion-polarity 11k 11k 1 1
SST binary sentiment analysis 67k 1.8k 1 1
SST fine-grained sentiment analysis 8.5k 2.2k 1 1
TREC question-type classification 6k 0.5k 1 1
SICK-E natural language inference 4.5k 4.9k 1 1
SNLI natural language inference 550k 9.8k 1 1
MRPC paraphrase detection 4.1k 1.7k 1 1
STS 2012 semantic textual similarity N/A 3.1k 0 0
STS 2013 semantic textual similarity N/A 1.5k 0 0
STS 2014 semantic textual similarity N/A 3.7k 0 0
STS 2015 semantic textual similarity N/A 8.5k 0 0
STS 2016 semantic textual similarity N/A 9.2k 0 0
STS B semantic textual similarity 5.7k 1.4k 1 0
SICK-R semantic textual similarity 4.5k 4.9k 1 0
COCO image-caption retrieval 567k 5*1k 1 0

where needs_train means a model with parameters is learned on top of the sentence embeddings, and set_classifier means you can define the parameters of the classifier in the case of a classification task (see below).

Note: COCO comes with ResNet-101 2048d image embeddings. More details on the tasks.

Probing tasks

SentEval also includes a series of probing tasks to evaluate what linguistic properties are encoded in your sentence embeddings:

Task Type #train #test needs_train set_classifier
SentLen Length prediction 100k 10k 1 1
WC Word Content analysis 100k 10k 1 1
TreeDepth Tree depth prediction 100k 10k 1 1
TopConst Top Constituents prediction 100k 10k 1 1
BShift Word order analysis 100k 10k 1 1
Tense Verb tense prediction 100k 10k 1 1
SubjNum Subject number prediction 100k 10k 1 1
ObjNum Object number prediction 100k 10k 1 1
SOMO Semantic odd man out 100k 10k 1 1
CoordInv Coordination Inversion 100k 10k 1 1

Download datasets

To get all the transfer tasks datasets, run (in data/downstream/):

./get_transfer_data.bash

This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default.

How to use SentEval: examples

examples/bow.py

In examples/bow.py, we evaluate the quality of the average of word embeddings.

To download state-of-the-art fastText embeddings:

curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
curl -Lo crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip

To reproduce the results for bag-of-vectors, run (in examples/):

python bow.py

As required by SentEval, this script implements two functions: prepare (optional) and batcher (required) that turn text sentences into sentence embeddings. Then SentEval takes care of the evaluation on the transfer tasks using the embeddings as features.

examples/infersent.py

To get the InferSent model and reproduce our results, download our best models and run infersent.py (in examples/):

curl -Lo examples/infersent1.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent1.pkl
curl -Lo examples/infersent2.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent2.pkl

examples/skipthought.py - examples/gensen.py - examples/googleuse.py

We also provide example scripts for three other encoders:

Note that for SkipThought and GenSen, following the steps of the associated githubs is necessary. The Google encoder script should work as-is.

How to use SentEval

To evaluate your sentence embeddings, SentEval requires that you implement two functions:

  1. prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
  2. batcher (transforms a batch of text sentences into sentence embeddings)

1.) prepare(params, samples) (optional)

batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.

prepare(params, samples)

Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.

2.) batcher(params, batch)

batcher(params, batch)

Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.

3.) evaluation on transfer tasks

After having implemented the batch and prepare function for your own sentence encoder,

1) to perform the actual evaluation, first import senteval and set its parameters:

import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}

2) (optional) set the parameters of the classifier (when applicable):

params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

You can choose nhid=0 (Logistic Regression) or nhid>0 (MLP) and define the parameters for training.

3) Create an instance of the class SE:

se = senteval.engine.SE(params, batcher, prepare)

4) define the set of transfer tasks and run the evaluation:

transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)

The current list of available tasks is:

['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense',
'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']

SentEval parameters

Global parameters of SentEval:

# senteval parameters
task_path                   # path to SentEval datasets (required)
seed                        # seed
usepytorch                  # use cuda-pytorch (else scikit-learn) where possible
kfold                       # k-fold validation for MR/CR/SUB/MPQA.

Parameters of the classifier:

nhid:                       # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim:                      # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity:                   # how many times dev acc does not increase before training stops
epoch_size:                 # each epoch corresponds to epoch_size pass on the train set
max_epoch:                  # max number of epoches
dropout:                    # dropout for MLP

Note that to get a proxy of the results while dramatically reducing computation time, we suggest the prototyping config:

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}

which will results in a 5 times speedup for classification tasks.

To produce results that are comparable to the literature, use the default config:

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

which takes longer but will produce better and comparable results.

For probing tasks, we used an MLP with a Sigmoid nonlinearity and and tuned the nhid (in [50, 100, 200]) and dropout (in [0.0, 0.1, 0.2]) on the dev set.

References

Please considering citing [1] if using this code for evaluating sentence embedding methods.

SentEval: An Evaluation Toolkit for Universal Sentence Representations

[1] A. Conneau, D. Kiela, SentEval: An Evaluation Toolkit for Universal Sentence Representations

@article{conneau2018senteval,
  title={SentEval: An Evaluation Toolkit for Universal Sentence Representations},
  author={Conneau, Alexis and Kiela, Douwe},
  journal={arXiv preprint arXiv:1803.05449},
  year={2018}
}

Contact: aconneau@fb.com, dkiela@fb.com

Related work