beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.54k stars 182 forks source link

custom metric implementation depend on length of qrels #102

Closed iknoorjobs closed 2 years ago

iknoorjobs commented 2 years ago

Hi @thakur-nandan, Great work! Keep it up! 🍻 Just noticed one thing in the code for the custom metric (here), the sum of MRR@k scores is divided by len(qrels). However, this should actually be divided by the number of queries for which MRR is computed (in your code this can be len(results)). Please let me know if I'm wrong. If not, I am happy to create a PR :)

Best, Iknoor Singh

thakur-nandan commented 2 years ago

Hi @iknoorjobs,

From my understanding, the evaluation should be based on the length of qrels.

For example, for a few queries in lexical search with Elasticsearch, each query is unable to return any top-k documents. In such cases, the results dictionary will not include this query. Here we must make sure the retriever is penalized by assigning MRR@k score equal to 0 for these queries present in qrels.

We always take qrels to be the ground truth and do evaluations based on it.

Kind Regards, Nandan Thakur

iknoorjobs commented 2 years ago

Hi @thakur-nandan, Thanks for your response. Yes, that makes sense. In this case, I'm using a custom dataset for evaluation and my qrels are combined for the test and train set queries, that's probably what's causing the wrong results. Regards, Iknoor