allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
316 stars 42 forks source link

Unified getter for the relevance level #254

Open TheMrSheldon opened 8 months ago

TheMrSheldon commented 8 months ago

Is your feature request related to a problem? Please describe. ir_datasets centralizes a lot of information about datasets. However, when using evaluation measures with binary levels (like MAP, MRR, ...), one sometimes needs to find the correct relevance level, which may be missed easily. Is it correct that ir_datasets currently does not track the minimum relevance level?

Describe the solution you'd like Would it be possible to add a function document.get_relevance_level() -> int that returns the minimum relevance level for the dataset (e.g., 1 for TREC DL '19 doc and 2 for TREC DL '19 passage)? Some datasets (e.g., ANTIQUE) also recommend a remapping of the graded relevance labels. Could this be automatically performed. For example that during the download of ANTIQUE the qrels get remapped from the 1-4 range to 0-3 and for ANTIQUE the relevance level would be returned as 2 (the standard relevance level of 3 also reduced by 1).

Describe alternatives you've considered To my knowledge, this currently has to be done manually.

Additional context Such a function could then be used in conjunction with pyterrier or pytrec_eval such that the user does not need to manually find and hardcode the relevance_level for every dataset they use. Such a feature could greatly reduce the risk of incomparable evaluation results if some people forget to set the correct relevance_level and others don't.

seanmacavaney commented 8 months ago

This sounds like a good addition, and I'm in favor of adding a dataset.qrels.binary_relevance_cutoff() function (or similar). Especially considering how frequently this causes folks problems.

The current solution is to provide this information as the "official measures". The sister project, ir-measures, specifies the minimum relevance threshold directly in the measure's name and passes it down when invoking pytrec_eval. See an example here: https://ir-datasets.com/msmarco-passage#msmarco-passage/trec-dl-2019

image

However, the official measure documentation isn't very complete (e.g., it's not documented for ANTIQUE), and in some cases, datasets don't have measure(s) that can be clearly marked as official.

I'm far more hesitant to perform any mapping of the data directly. From a software design perspective, this seems like the job of the evaluator, not the data provider. This would also be a breaking change for anybody who is already using an unmapped version of the qrels.