Closed ji-xin closed 3 years ago
Hi @ji-xin,
In Few corpora such as Quora as Arguana, both query and document are similar (both passage or sentence), for eg. in Quora both query and document are a single sentence question and in Arguana both are passages. Now, when you retrieve a question in quora (for eg. q1: how many states are present in India?) it retrieves the identical input question itself as the most similar document (i.e q1 is returned as top-result) as the input query is present within the corpus!
To avoid such a situation, we check whether the returned result (corpus_id
) has the same id as the (query_id
), if yes, we explicitly remove the document from the returned results. This doesn't have an effect on other datasets as query is a sentence, while documents are passages.
Kind Regards, Nandan Thakur
Thanks for the response! Sounds like a reasonable move.
https://github.com/UKPLab/beir/blob/933b349bf300718cd6a2d285c51fe78f48fdec85/beir/retrieval/search/dense/exact_search.py#L77
I couldn't understand what this line is trying to do...
corpus_id
andquery_id
are from completely different groups and it's fine that they are the same, right? Removing thisif
statement has a huge impact on ndcg score (tested with ANCE@arguana).