beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.57k stars 191 forks source link

Why need check corpus_id != query_id in DenseRetrievalExactSearch.search() #169

Open mengyao00 opened 5 months ago

mengyao00 commented 5 months ago

Why do we need this line to check corpus_id != query_id

for a query with id_q, the corpus with the same id id_q does not mean it is the positive corpus for it. So why do we need to avoid corpus_id == query_id

            for query_itr in range(len(query_embeddings)):
                query_id = query_ids[query_itr]                  
                for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]):
                    corpus_id = corpus_ids[corpus_start_idx+sub_corpus_id]
                    if corpus_id != query_id:
                        if len(result_heaps[query_id]) < top_k:
                            # Push item on the heap
                            heapq.heappush(result_heaps[query_id], (score, corpus_id))
                        else:
                            # If item is larger than the smallest in the heap, push it on the heap then pop the smallest element
                            heapq.heappushpop(result_heaps[query_id], (score, corpus_id))

        for qid in result_heaps:
            for score, corpus_id in result_heaps[qid]:
                self.results[qid][corpus_id] = score

        return self.results 
thakur-nandan commented 5 months ago

Hi @mengyao00, thanks for asking the question.

We require this line for two datasets: ArguAna and Quora, where corpus_ids and query_ids are similar, i.e., the query is also present within the corpus.

The line is used to avoid the edge case of self-retrieval where the query is self-retrieved at the top-1 position, which reduces the nDCG@10 score for ArguAna and Quora.

Hope it helps!

Regards, Nandan Thakur