UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.21k stars 2.47k forks source link

Semantic search/cos_sim on billion-size dataset? #1459

Open khcy82dyc2 opened 2 years ago

khcy82dyc2 commented 2 years ago

I wonder what is the most efficient way to perform semantic_search/cos_sim function on one billion corpus embeddings, since cuda/CPU memory will not fit them all at once? Apart from breaking the dataset using a for loop.

nreimers commented 2 years ago

Do you need exact results => Then the only way it so break the dataset

If approximate results are sufficient: https://www.sbert.net/examples/applications/semantic-search/README.html#approximate-nearest-neighbor