UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.38k stars 2.4k forks source link

Searching for exact keyword using sbert models #2238

Open ankitas3 opened 1 year ago

ankitas3 commented 1 year ago

I’m using one of the hugging face models: sentence-transformers/all-MiniLM-L6-v2 for semantic search. Currently I'm facing trouble while searching for exact keywords. This is basically required when searching for the following: a person’s name - John Davis a specific id/number - 2023 a keyword containing special characters - Legal-Compliance, Year’23, $200, Q&A.

I have data of large lengths(more than 500 words) and so for embedding creation, the data is stripped into an array of sentences of length 100 each and then encoded and averaged.

# embedding_tokens - array of sentence tokens
embedding = model.encode(embedding_tokens)
embeddings = np.mean(embedding, axis=0)

These embeddings are then stored and searched using OpenSearch which currently is returning irrelevant results/less relevant results on the top

Can someone help me with this. Is this the correct way to combine/average-out the embeddings? How do I search numbers and keywords with special characters here?

carlesoctav commented 1 year ago

The sentence embedding cannot search for exact keywords since it searches in a vector space, not in the statistical properties of terms. If you want to use both semantic and keyword search, you can perform a hybrid search. However, if you are always searching for exact terms, it is better to use classical algorithms such as BM25.

aishwaryabajaj-54 commented 1 year ago

Do we have any workaround for exact search to work?