beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Can't understand a certain line #37

Closed ji-xin closed 3 years ago

ji-xin commented 3 years ago

https://github.com/UKPLab/beir/blob/933b349bf300718cd6a2d285c51fe78f48fdec85/beir/retrieval/search/dense/exact_search.py#L77

I couldn't understand what this line is trying to do... corpus_id and query_id are from completely different groups and it's fine that they are the same, right? Removing this if statement has a huge impact on ndcg score (tested with ANCE@arguana).

thakur-nandan commented 3 years ago

Hi @ji-xin,

In Few corpora such as Quora as Arguana, both query and document are similar (both passage or sentence), for eg. in Quora both query and document are a single sentence question and in Arguana both are passages. Now, when you retrieve a question in quora (for eg. q1: how many states are present in India?) it retrieves the identical input question itself as the most similar document (i.e q1 is returned as top-result) as the input query is present within the corpus!

To avoid such a situation, we check whether the returned result (corpus_id) has the same id as the (query_id), if yes, we explicitly remove the document from the returned results. This doesn't have an effect on other datasets as query is a sentence, while documents are passages.

Kind Regards, Nandan Thakur

ji-xin commented 3 years ago

Thanks for the response! Sounds like a reasonable move.