Efficient storage for IDs + sentence embeddings

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.23k stars 2.47k forks source link

Efficient storage for IDs + sentence embeddings #1020

Open Matthieu-Tinycoaching opened 3 years ago

Matthieu-Tinycoaching commented 3 years ago

Hi,

I would like to use sentence-transformer for semantic similarity. On the one hand I have query sentence whose embeddings will be computed on the fly. On the other hand I have corpus sentences whose embeddings will be pre-computed before launch of the application. What are the best options for storing IDS + pre-computed embeddings of corpus sentences?

1) Is pickle a good solution and up to which number of sentences that it works well? 2) When is it needed to use FAISS storage or a particular DB?

Thanks!

nreimers commented 3 years ago

Yes, pickle is fine.

The question is at what time exact search will be too slow and when you will need ANN. On CPU, you can do exact search up to 100k - 500k. If you have your corpus on a GPU, you could fit up to 5M.

After that, using ANN makes sense: https://www.sbert.net/examples/applications/semantic-search/README.html#approximate-nearest-neighbor

Matthieu-Tinycoaching commented 3 years ago

Thanks @nreimers for your feedback.

Matthieu-Tinycoaching commented 3 years ago

@nreimers I have a follow up question related to the use of containerized docker image for inference on cloud services. What would be the most efficient way to load calculated corpus embeddings on each query request: call to a cloud database on each request or load pickle embeddings on each request?

Thanks for your feedback, it was easy to deal with this problem on local computer but when thinking of deployment on cloud services, it is less evident to think on which is the fastest way to load embeddings.

nreimers commented 3 years ago

Hi @Matthieu-Tinycoaching I think loading the pickle file would be the most efficient.

You could also think about deploying a vector search database like Elasticsearch, OpenSearch/OpenDistro, Vespa.ai etc.