beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Script to generate leaderboard metrics #65

Closed thigm85 closed 2 years ago

thigm85 commented 2 years ago

Can I find the complete script used to generate this leaderboard somewhere? I saw snippets such as benchmark_bm25.py but not a full-scale script that includes elastic search config and all.

I am implementing a BEIR compatible Vespa version that I plan to submit as a PR soon. I am, however, finding different results between my BM25 metrics and the elastic BM25 results from the leaderboard.

Generating results side by side would be great to debug my implementation.

thakur-nandan commented 2 years ago

Hi @thigm85,

I used this python script here: evaluate_bm25.py to generate scores for Elasticsearch BM25 in the leaderboard.

However as notified recently in #58, some metrics might be different from what I got in the leaderboard. I can rerun them soon and update the latest metrics for the leaderboard. However, since these changes have not been currently reflected in the latest pip version, you can download the development branch locally and use evaluate_bm25.py to get the accurate Elasticsearch BM25 scores.

Kind Regards, Nandan Thakur

thigm85 commented 2 years ago

Thanks for the reply

ziqing-huang commented 2 years ago

Hi @thigm85,

I used this python script here: evaluate_bm25.py to generate scores for Elasticsearch BM25 in the leaderboard.

However as notified recently in #58, some metrics might be different from what I got in the leaderboard. I can rerun them soon and update the latest metrics for the leaderboard. However, since these changes have not been currently reflected in the latest pip version, you can download the development branch locally and use evaluate_bm25.py to get the accurate Elasticsearch BM25 scores.

Kind Regards, Nandan Thakur

Hi @NThakur20,

I want to confirm if you have updated the metrics. I fail to reproduce them with the latest code.