UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.17k stars 2.47k forks source link

Have you eval the performance of this embeddings vs BM25 in short sentence ranking task? #22

Closed guotong1988 closed 5 years ago

guotong1988 commented 5 years ago

I have train a BERT-base model in Quora Pair Dataset, which is a text similarity task. It seems the BM25 result is about 56% in top 1 ranking accuracy in test dataset of 40000 sentence. But this sentence embeddings is about only 50% in top 1 ranking accuracy.

nreimers commented 5 years ago

Hi @guotong1988 how did you model the task? Was it model as a pairwise classification task (given the pairs from the test set, classifiy it as duplicate vs. not-duplicate). Or did you model it as an Information Retrieval task: Given a question, find in a corpus of e.g. 20k questions the one question that is a duplicate?

guotong1988 commented 5 years ago

Thank you for your reply.

I train the model in the training data, by modify https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_stsbenchmark_bert.py

I use the final embeddings for each test data.

I view it as a IR task, give a question from 40000 question and get top 20 from the same 40000 questions. The top 20 do not contain the given question itself.

For more info see https://arxiv.org/abs/1908.08326 , ignore the tree related things. (I remove the comparation to BM25 as the embeddings result is below BM25, if no mistake by me)

nreimers commented 5 years ago

Sentence embeddings methods (like Sentence-BERT) work better in my experience for "recall-oriented" tasks: If two sentences share no or only few overlapping word, they can still produce a meaningful similarity score.

BM25 or tf-idf is on the other hand extremely strict and leads to "a high precision but low recall".

For information retrieval, where you have large corpora, this is actually quite a big advantage and really hard to beat with any sentence embedding method.

Assume sentence embeddings have this properties: False positive rate (sentence pair gets high similarity score even such it is not similar): 0.1% False negative rate (sentence pair gets low score, even such it is similar): 0.5%

Assume BM25 has these properties: False positive rate: 0.01% False negative rate: 5%

If you work with small datasets, sentence embeddings approaches is the better option, as you will not suffer a-lot from the higher false positive rate.

In Information Retrieval, the higher false positive rate is really bad.

Assume you do retrieval from 10k sentences with sentence embeddings: Than a false positive rate of 0.1% leads to 10 false positives and you maybe retrieve the one correct duplicate question. Your chance of getting everything right is about 1 out of 11.

Not assume you do the retrieval with BM25: The false positive rate of 0.01% leads to only 1 false positive, plus the one correct duplicate question. Your chance of getting everything right is about 1 out of 2.

So with sentence embeddings methods it is worthwhile to look at false positive and false negative rates. Depending on the application, you want to have lower false positive or lower false negative rates.

For IR, you usually want a low false positive rate. For the STS task, you want a low false negative rate.

I hope this is helpful.

guotong1988 commented 5 years ago

Thank you very much!!

gregor-ge commented 5 years ago

Hi, I worked with @nreimers on this and also did some experiments in similar directions. Hope I can be of some help as well.

1) Maybe try different similarity metrics, e.g. Manhattan distance or euclidean distance. In my experience, the right metric can sometimes have a large impact on the performance.

2) I also experimented with IR on Quora and AskUbuntu. I based it on this paper https://arxiv.org/abs/1811.08008. Similar to you, the task is given all questions find the duplicates. I used the pretrained models bert-large-nli-stsb-mean-tokens and bert-large-nli-mean-tokens and compared them against TfIdf (but not BM25). bert-large-nli-stsb-mean-tokens achieved MAP@100 of 74% for Quora and 17% for AskUbuntu. bert-large-nli-mean-tokens had 61% and 11%. TfIdf managed 58% and 27%.

You could try one of the pretrained models or maybe train additionally on some other datasets. Or maybe finetune a pretrained model on Quora.

guotong1988 commented 5 years ago

Thank you very much!

mchari commented 4 years ago

@nreimers , thanks for the insight above. maybe that is why a combination of tf-idf and embedding-based methods that is tuned for one's use case is required ?

nreimers commented 4 years ago

@mchari Yes, combining embedding based approaches with BM25 brings nice improvements. See: https://arxiv.org/pdf/2004.13969.pdf