beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Why index the queries together with the corpus? #72

Closed thigm85 closed 2 years ago

thigm85 commented 2 years ago

I noticed that some datasets (e.g., Arguana) include the queries on the documents fed to the search engine. In addition, the evaluation code ignores the scores of the indexed queries. Is there a reason for such a choice?

thakur-nandan commented 2 years ago

Hi @thigm85, yes this would be there for ArguAna and Quora datasets. This is due to the inherent nature of these tasks and how originally the datasets were created.

ArguAna task involves retrieval of a counterargument (of passage length) for an input argument (also of passage length). Whereas also for Quora, the task involves retrieval of duplicate query (usually of sentence length) for an input query (also of sentence length). So only for both of these datasets, it is possible a query (i.e. a passage for ArguAna and query for Quora) can be a possible answer for a different query in the dataset and hence can be used interchangeably.

The reason for ignoring the scores is because for an input query you might retrieve itself i.e. the same query present within the collection. So we move ahead and remove these queries.

Except, for these two datasets you would not find this happening in other datasets, as the query is usually of sentence length whereas the corpus involves documents of (passage length) which cannot be used interchangeably.

Kind Regards, Nandan Thakur

thigm85 commented 2 years ago

Thanks for the context @NThakur20.