beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.49k stars 177 forks source link

Fix `DenseRetrievalExactSearch` evaluation #154

Open NouamaneTazi opened 11 months ago

NouamaneTazi commented 11 months ago

I noticed there was a problem in the way we handled queries that exist in the retrieval corpus. By default we have ignore_identical_ids=True which pops these duplicated queries from the results. Which means some queries would have top_k retrieved documents, while others have top_k-1 retrieved documents.

Fixing this behaviour gives a noticeable change in scores. Here's the difference in scores noticed for "intfloat/e5-large" on ArguAna evaluated using MTEB:

    model = SentenceTransformer("intfloat/e5-large", device="cuda")
    eval = MTEB(tasks=["ArguAna"])
    eval.run(model, batch_size=512*2, corpus_chunk_size=10000, overwrite_results=True)

Scores before fix:

INFO:mteb.evaluation.MTEB:Scores: {'ndcg_at_1': 0.27596, 'ndcg_at_3': 0.42701, 'ndcg_at_5': 0.48151, 'ndcg_at_10': 0.53452, 'ndcg_at_100': 0.57081, 'ndcg_at_1000': 0.57226, 'map_at_1': 0.27596, 'map_at_3': 0.38976, 'map_at_5': 0.41967, 'map_at_10': 0.44187, 'map_at_100': 0.4507, 'map_at_1000': 0.45077, 'recall_at_1': 0.27596, 'recall_at_3': 0.53485, 'recall_at_5': 0.66856, 'recall_at_10': 0.83073, 'recall_at_100': 0.98578, 'recall_at_1000': 0.99644, 'precision_at_1': 0.27596, 'precision_at_3': 0.17828, 'precision_at_5': 0.13371, 'precision_at_10': 0.08307, 'precision_at_100': 0.00986, 'precision_at_1000': 0.001, 'mrr_at_1': 0.28378, 'mrr_at_3': 0.39284, 'mrr_at_5': 0.42261, 'mrr_at_10': 0.44498, 'mrr_at_100': 0.45374, 'mrr_at_1000': 0.45381, 'evaluation_time': 127.59}

Scores after fix:

INFO:mteb.evaluation.MTEB:Scores: {'ndcg_at_1': 0.41963, 'ndcg_at_3': 0.57859, 'ndcg_at_5': 0.62677, 'ndcg_at_10': 0.65648, 'ndcg_at_100': 0.67739, 'ndcg_at_1000': 0.67846, 'map_at_1': 0.41963, 'map_at_3': 0.53983, 'map_at_5': 0.56664, 'map_at_10': 0.57907, 'map_at_100': 0.58407, 'map_at_1000': 0.58413, 'recall_at_1': 0.41963, 'recall_at_3': 0.69061, 'recall_at_5': 0.80725, 'recall_at_10': 0.89829, 'recall_at_100': 0.98862, 'recall_at_1000': 0.99644, 'precision_at_1': 0.41963, 'precision_at_3': 0.2302, 'precision_at_5': 0.16145, 'precision_at_10': 0.08983, 'precision_at_100': 0.00989, 'precision_at_1000': 0.001, 'mrr_at_1': 0.41963, 'mrr_at_3': 0.53983, 'mrr_at_5': 0.56664, 'mrr_at_10': 0.57907, 'mrr_at_100': 0.58407, 'mrr_at_1000': 0.58413, 'evaluation_time': 112.69}

cc @thakur-nandan