beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Eval script breaks for custom evaluations when query does not have any hits #70

Closed narayanacharya6 closed 2 years ago

narayanacharya6 commented 2 years ago

Steps to reproduce:

  1. Setup:

    git clone git@github.com:UKPLab/beir.git
    cd beir
    conda create --name beir python=3.7
    conda activate beir
    pip install -e .
  2. Run script sample.py:

    
    """
    Sourced from examples/retrieval/evaluation/lexical/evaluate_bm25.py
    """

from beir import util from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval from beir.retrieval.search.lexical import BM25Search as BM25

import pathlib, os

dataset = "nfcorpus" url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset) out_dir = os.path.join(pathlib.Path(file).parent.absolute(), "datasets") data_path = util.download_and_unzip(url, out_dir)

corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") hostname = "http://0.0.0.0:9200" index_name = "nfcorpus_bug" initialize = True

number_of_shards = 1 model = BM25(index_name=index_name, hostname=hostname, initialize=initialize, number_of_shards=number_of_shards) retriever = EvaluateRetrieval(model) results = retriever.retrieve(corpus, queries)

for metric in ["mrr", "recall_cap", "hole", "accuracy"]: retriever.evaluate_custom(qrels, results, retriever.k_values, metric=metric)


Stack trace:

Traceback (most recent call last): File "sample.py", line 78, in retriever.evaluate_custom(qrels, results, retriever.k_values, metric=metric) File "/Users/narayan/OSS/beir/beir/retrieval/evaluation.py", line 92, in evaluate_custom return mrr(qrels, results, k_values) File "/Users/narayan/OSS/beir/beir/retrieval/custom_metrics.py", line 22, in mrr for rank, hit in enumerate(top_hits[query_id][0:k]): KeyError: 'PLAIN-510'



Notes:
I am running ElasticSearch v7.17.0. via docker locally.

Preliminary Investigation:
I think when a query does not have any hits in ES we should add an empty `scores` dict to our `self.results` in the below code so our eval scripts do not raise KeyError if they don't find a query in the `results`.

https://github.com/UKPLab/beir/blob/a55552db70f37102352fd5c21b4e811516659a55/beir/retrieval/search/lexical/bm25_search.py#L49-L54
thakur-nandan commented 2 years ago

Hi @narayanacharya6,

Thanks for reporting this issue. Indeed looking at the error in your stack trace, I believe what you mentioned is happening. Elasticsearch may not return hits for a query it did not find at all. This is by default taken care of in evaluation using pytrec_eval for Recall, Precision, NDCG, etc. but not in my custom definition of MRR and possibly others. One could add zero-scores to self.results, but a better solution would be to handle it at the evaluation step of each metric. Will work on it and update the dev branch soon!

Kind Regards, Nandan Thakur