beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 184 forks source link

Rerank scores lower than vanilla dense IR? #54

Open pablogranolabar opened 2 years ago

pablogranolabar commented 2 years ago

Hi,

I've got a dense IR pipeline running with rerank, for a search engine application. However my rerank scores are lower than just a dense IR run?

msmarco-distilbert-base-v3
ms-marco-electra-base cross encoder

Scores:

Dense IR                                        DenseIR + Re-Rank
2021-11-30 16:48:39 - NDCG@1: 0.3629        2021-11-30 16:56:16 - NDCG@1: 0.3538
2021-11-30 16:48:39 - NDCG@3: 0.5234        2021-11-30 16:56:16 - NDCG@3: 0.5170
2021-11-30 16:48:39 - NDCG@5: 0.5472        2021-11-30 16:56:16 - NDCG@5: 0.5401
2021-11-30 16:48:39 - NDCG@10: 0.5623       2021-11-30 16:56:16 - NDCG@10: 0.5540
2021-11-30 16:48:39 - NDCG@100: 0.5879      2021-11-30 16:56:16 - NDCG@100: 0.5812
2021-11-30 16:48:39 - NDCG@1000: 0.5965     2021-11-30 16:56:16 - NDCG@1000: 0.5812

2021-11-30 16:48:39 - MAP@1: 0.3629     2021-11-30 16:56:16 - MAP@1: 0.3538
2021-11-30 16:48:39 - MAP@3: 0.4844     2021-11-30 16:56:16 - MAP@3: 0.4774
2021-11-30 16:48:39 - MAP@5: 0.4977     2021-11-30 16:56:16 - MAP@5: 0.4903     
2021-11-30 16:48:39 - MAP@10: 0.5040        2021-11-30 16:56:16 - MAP@10: 0.4961
2021-11-30 16:48:39 - MAP@100: 0.5090       2021-11-30 16:56:16 - MAP@100: 0.5013
2021-11-30 16:48:39 - MAP@1000: 0.5093      2021-11-30 16:56:16 - MAP@1000: 0.5013

2021-11-30 16:48:39 - Recall@1: 0.3629      2021-11-30 16:56:16 - Recall@1: 0.3538
2021-11-30 16:48:39 - Recall@3: 0.6362      2021-11-30 16:56:16 - Recall@3: 0.6315
2021-11-30 16:48:39 - Recall@5: 0.6932      2021-11-30 16:56:16 - Recall@5: 0.6869
2021-11-30 16:48:39 - Recall@10: 0.7397     2021-11-30 16:56:16 - Recall@10: 0.7297
2021-11-30 16:48:39 - Recall@100: 0.8627    2021-11-30 16:56:16 - Recall@100: 0.8618
2021-11-30 16:48:39 - Recall@1000: 0.9310   2021-11-30 16:56:16 - Recall@1000: 0.8618

2021-11-30 16:48:39 - P@1: 0.3629       2021-11-30 16:56:16 - P@1: 0.3538
2021-11-30 16:48:39 - P@3: 0.2121       2021-11-30 16:56:16 - P@3: 0.2105
2021-11-30 16:48:39 - P@5: 0.1386       2021-11-30 16:56:16 - P@5: 0.1374
2021-11-30 16:48:39 - P@10: 0.0740      2021-11-30 16:56:16 - P@10: 0.0730
2021-11-30 16:48:39 - P@100: 0.0086     2021-11-30 16:56:16 - P@100: 0.0086
2021-11-30 16:48:39 - P@1000: 0.0009        2021-11-30 16:56:16 - P@1000: 0.0009

Any thoughts would be greatly appreciated.

thakur-nandan commented 2 years ago

Hi @pablogranolabar,

This is indeed strange. Thanks for sharing these values.

  1. Could you share on which dataset did you find these numbers? Is this a custom dataset of yours?
  2. How many documents did you rerank after retrieving using msmarco-distilbert-base-v3?

Also, could you try using the ms-marco-MiniLM-L-6-v2 (https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) model? This is a stronger model compared to the ms-marco-electra-base.

Kind Regards, Nandan Thakur

pablogranolabar commented 2 years ago

Hi @NThakur20, thanks for making your work available and for the speedy reply!

Yes this a custom dataset, a collection of search engine queries such as returning company information for a ticker.

For rerank, I used the default of 100 documents I think it is.

I will check out MiniLM next, thanks for the help!

pablogranolabar commented 2 years ago

Hi again @NThakur20, I swapped out the cross encoder with ms-marco-MiniLM-L-6-v2 but I am still getting subpar re-rank scores after the dense IR. Any thoughts?

thakur-nandan commented 2 years ago

Hi @pablogranolabar,

Could you manually evaluate the top-k documents (let's say for k=10), and check whether the results are as expected? One reason could be how the test data was annotated?

Could you share a snippet of your pseudocode to check once if everything is working as expected?

Kind Regards, Nandan Thakur

pablogranolabar commented 2 years ago

Hi @NThakur20, yes I've experimented with lower k values all the way down to 10 as well as varying batch sizes. Pretty much the same results, rerank scores are across the board lower after dense IR. The dataset is pretty small though, just about 13K search queries and their anticipated results. Think that would be a large factor with this?

And how important would hyperparameter optimization be in this scenario, I've been thinking about putting together an RL environment for that to increase precision which is low but recall and the other two scores are consistently high.

thakur-nandan commented 2 years ago

Hi @pablogranolabar, maybe try initially with Elasticsearch as the first step and further rerank top-k using the above-mentioned cross-encoder?

In our publication, we found lexical retrieval + CE rerank combination to work well.

cramraj8 commented 1 year ago

@thakur-nandan I am experimenting BM25 + CE for TREC-NEWS, TREC-COVID, and NQ. However, for TREC-COVID I am getting lower re-ranking performance than BM25 scores after using ms-marco-MiniLM-L-6-v2 as zero-shot re-ranker. Do I have to fine-tune it again ? The table of results in your paper having column BM25+CE contains scores after fine-tuning the MiniLM or zeros-shot performance ?

cramraj8 commented 1 year ago

I just realized after combining title + text combined multi-field text and re-ranking, I was able to reproduced the scores reported in the paper.