beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.51k stars 178 forks source link

Is re_ranking implementation wrong? #106

Open wasiahmad opened 1 year ago

wasiahmad commented 1 year ago

Either my understanding is wrong or the implementation of re-ranking is wrong. The EvaluateRetrieval.rerank looks like:

def rerank(self, 
        corpus: Dict[str, Dict[str, str]], 
        queries: Dict[str, str],
        results: Dict[str, Dict[str, float]],
        top_k: int) -> Dict[str, Dict[str, float]]:

    new_corpus = {}

    for query_id in results:
        if len(results[query_id]) > top_k:
            for (doc_id, _) in sorted(results[query_id].items(), key=lambda item: item[1], reverse=True)[:top_k]:
                new_corpus[doc_id] = corpus[doc_id]
        else:
            for doc_id in results[query_id]:
                new_corpus[doc_id] = corpus[doc_id]

    return self.retriever.search(new_corpus, queries, top_k, self.score_function)

By re-ranking we mean, given the top-k (e.g., 100) retrieved doc for each query, a re-ranker will output scores for the top_k retrieved documents only and then they will be sorted based on the scores. However, according to the above implementation, the re-ranker is actually re-ranking a list of documents that may contain up to top_k * len(queries) documents, isn't it? Am I missing something?

@thakur-nandan need your help.

tobicoveo commented 1 year ago

I agree, I noticed the same thing. The search seems to be done using the same pool of documents for each of the queries. The pool of documents is the joint set of the retrieved document for all the queries.

The search method seems completely agnostic to the doc_id associated with a given query and no filtration is done on the retrieved document from the call to search to only keep the doc_ids from the retrieval.

Muennighoff commented 1 year ago

This is pretty important as it would mean the BM25 + Reranking results overstate its actual performance

nreimers commented 1 year ago

Have a look at this code example: https://github.com/beir-cellar/beir/blob/c3334fd5b336dba03c5e3e605a82fcfb1bdf667d/examples/retrieval/evaluation/reranking/evaluate_bm25_ce_reranking.py#L63

It doesn't use the EvaluateRetrieval.rerank method, instead it uses the Rerank.rerank method.

Muennighoff commented 1 year ago

Have a look at this code example:

https://github.com/beir-cellar/beir/blob/c3334fd5b336dba03c5e3e605a82fcfb1bdf667d/examples/retrieval/evaluation/reranking/evaluate_bm25_ce_reranking.py#L63

It doesn't use the EvaluateRetrieval.rerank method, instead it uses the Rerank.rerank method.

Ah great, thanks! 👍

tobicoveo commented 1 year ago

That is for the cross-encoder example, but it doesn't affect the example using reranking for a dual-encoder (sentence-bert). It also impacts potential custom implementation that uses the EvaluateRetrieval rerank method.

image

https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/reranking/evaluate_bm25_sbert_reranking.py#L44