erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.82k stars 724 forks source link

Consider adding a sentence embedding dataset #144

Open jtibshirani opened 4 years ago

jtibshirani commented 4 years ago

In addition to word embedding models like GloVe, there are now text embedding models like BERT and Universal Sentence Encoder that work at the level of sentences. These embedding models take an entire sentence as input and output a single vector representation. Sentence embedding models have started to be used alongside kNN to power text search applications.

It would be great to have a dataset based on a sentence embedding model. A sentence vector dataset can show different properties from the set of GloVe word vectors, and kNN algorithms may show a difference in performance between the two.

As an example, a sentence vector dataset could be created by gathering a set of short questions from Stack Exchange, then running each question through an embedding model like Universal Sentence Encoder.

erikbern commented 4 years ago

That makes sense. It would be great if there's already some kind of popular dataset that we can rely on – I don't want to make it too "arbitrary". If we have to build a custom pipeline then I'd love for it to be very trivial (like 10-20 lines at most to generate the data). Any thoughts?

jtibshirani commented 4 years ago

It would be great if there's already some kind of popular dataset that we can rely on – I don't want to make it too "arbitrary"

I was also wondering about this point -- I haven't come across a commonly used dataset of sentence embedding vectors and would be curious if others had ideas.

To add some context, the idea of using StackExchange was motivated by the fact that it has a friendly license, and that it has formed the basis for some research around question retrieval + duplicate question detection (including the dataset CQADupStack). In a comparison of text embedding models, Universal Sentence Encoder showed the best performance on semantic similarity tasks. It's available in a pre-trained form and can be downloaded from TensorFlow Hub. (It's worth noting that text embedding is an active area, and new models/ versions seem to come out pretty often).

erikbern commented 4 years ago

i'm definitely open to adding this provided it can use as many standard datasets and algorithms as possible, and it can be reasonably concise.

https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/datasets.py is where all the datasets are defined so far. of course, most people can ignore that file and just download the hdf5 files.