UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.27k stars 2.47k forks source link

A lot of noise in semantic search #174

Closed ironllamagirl closed 4 years ago

ironllamagirl commented 4 years ago

Hi. Thank you for this great package! I am trying to use the semantic search example in order to detect sentences belonging to specific topics. I translated the different topics to query sentences to use with the semantic search.

The problem is that I am getting a lot of noise in the results. Many of the 'matching' sentences to the queries I am using have nothing to do with the query yet they still have a smaller distance than other related sentences with bigger distance. Is there a way to avoid this? I am trying to solve this, as the use-case I am working on requires me to have as few noise sentences as possible, ideally none. I tried to play with the threshold distance but it is hard to guess which one works best.

The initial idea that I had was to apply a simple keyword search on the results of the semantic search to eliminate noise. However I am afraid that I could be losing many 'good' sentences since the list of keywords can't be completely exhaustive. I am afraid of losing good sentences that are semantically similar to the query but without containing any of the keywords I am choosing.

Another potential way to do this is to train a model so that it can tell if a sentence belongs to a certain topic or not - I haven't tried this yet. Could you please share your opinion/suggestions about this? I would really appreciate it. Thanks!

nreimers commented 4 years ago

Hi @ironllamagirl Can you provide some more information? Which model did you try? What type of data do you have?

What you describe is the same experience I have with sentence embeddings (I tried all common methods), that you get a lot of noise.

You can characterize methods by false positive and false negative rates: False positive: A dissimilar pair gets a high score and is falsely include in the top 10 False negative: A similar pair gets a low score and is not returned in the top 10 results

TF-IDF / BM-25 has low false positive rate but a high false negative rate, i.e., the result if finds are often relevant, but is sadly misses a lot of relevant pairs.

Sentence embeddings have the opposite characteristic: Low false negative rate, i.e, it finds all relevant pairs, but a high false positive rate, i.e., the top results often contain noise with non-relevant matches.

Currently I evaluate different approach for question based semantic similarity search. I hope that I can soon (~1 month) share some data + results on this with different experiments.

If computationally feasible, I think the best approach is a two step approach: Step 1) Retrieval: Retrieve with BM-25 and with sentence embedding search top-100 matches Step 2) Filtering: If possible, use BERT (or similar) and score every (query, candidate_i) pair and select the top-10 results.

For step 2 you would need some training data.

For retrieval with sentence embeddings, it could also make sense to train a model with triplet loss if you have some training data.

Best Nils Reimers

ironllamagirl commented 4 years ago

Hi @nreimers
Thank you for your response. The task that I have in hand is to retrieve text related to how companies deal with environmental issues/risks, from different text data sources. Ideally match the sentences to different issues related to the environment. The sample of data that I am working with so far are news articles about a few companies, where some discuss environment-related issues. So far I started with using the bert-base-nli-mean-tokens model directly, just like you have in the semantic search example. I got a lot of noise.

As an attempt to improve the results, I tried fine-tuning. I fine-tuned the 'bert-base-uncased' model on my custom data, using the triplet loss method. I followed the anchor, positive and negative schema. I spent some time labeling sentences (related to environment or not), and then I generated all non-repetitive pairs using the sentences labeled as 'related'. These pairs represent the 'anchor' and 'positive' sentences. I then used the 'unrelated sentences' as 'negative'. The results showed less noise, but the noise still makes up a relatively big percentage of the results.

I am actually not very familiar with the BM25 method for semantic search,. Is its only objective to compute tfidf for documents? There doesn't seem to be a lot of examples online for this method.

In step 1, I assume sentence embedding search is what was implemented in the semantic search example, am I right? So you suggest to append the results from both tfidf and sentence embedding search together and then do the filtering? Could you please give more explanation as to how the filtering is done using bert? by 'score' do you mean compute the distance between query and candidate? Isn't that what the sentence embedding search is doing as well?

Thanks.

nreimers commented 4 years ago

Hi @ironllamagirl Here some papers you might find interesting:

https://arxiv.org/abs/1907.04780 https://arxiv.org/abs/2002.08909 https://openreview.net/forum?id=rkg-mA4FDr https://arxiv.org/abs/1905.01969 https://arxiv.org/abs/1811.08008

Also a project that might be interesting for you: https://github.com/koursaros-ai/nboost

BM25: BM25 is similar to TF-IDF, but often works much better. It takes different document lengths into the consideration. ElasticSearch (which I can quite recommend) is based on BM25 to index and find documents..

An approach that works really good is the approach that is also implemented in Nboost: Neural re-ranking.

The idea is you have two phases: A retrieval phase and a re-ranking phase.

In the retrieval phase, you get for example 100 hits. You could split this in getting 50 hits with Elasticsearch (BM25) and 50 hits with semantic search using sentence embeddings, or you could just get 100 hits with BM25 from Elasticsearch.

In the second step, you apply a more complex model: The re-ranker.

This re-ranker gets as input (query, hit1), (query, hit2), ..., (query, hit100). For each pair, it outputs a value 0...1 about how relevant the pair is. Nboost uses for this BERT which was previously trained on suitable data. It has several pre-trained models, which should generalize quite well to other domains.

The final results are then the top-10 pairs that got the highest score from the re-ranker.

You can find more details on re-ranking here: https://arxiv.org/pdf/1901.04085.pdf

Best Nils Reimers

ironllamagirl commented 4 years ago

Hi Nils,

Thank you very much for these resources. Very helpful. I ended up applying BM25 as a second 're-ranker'. It reduced noise marginally, but sacrificed good sentences. I will possibly build a manual dataset for reranking in the future.

Thank you again! I'm closing the issue for now.

braaannigan commented 4 years ago

Hi @nreimers @ironllamagirl

Just want to say that I've built a semantic search engine using this wonderful package without re-ranking. I'm not able to release the code at the moment, but wanted to share some pointers:

keyuchen21 commented 1 year ago

@nreimers

Hi Nils, any update on your evaluation/experiments on different approach for question based semantic similarity search?

keyuchen21 commented 1 year ago

@ironllamagirl

so you first use sentence transformers to similarity, then use BM25 for re-ranking? would you mind share steps or code?

Thanks!