beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.54k stars 182 forks source link

Inconsistency in Arguana #101

Open claeyzre opened 2 years ago

claeyzre commented 2 years ago

First let me thank you for the huge work putting this benchmark together.

While downloading and processing the dataset I came accross something weird in the Arguana Dataset

The id test-free-speech-debate-yfsdfkhbwu-con03b is considered a relevant passage in the qrels/test.tsv file. But this id is not present in the corpus.jsonl file.

In the pytrec eval tool used, the tool checks whether the query id is present and if not log something to tell us but it's not the case for passage id. Thus I think this qrel line will be valid but will never be satisfied during evaluation since the passage id is not in the corpus. Is that a normal behavior ? or should it be filtered in the beir original dataset ?

Thanks, Remi