beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.49k stars 177 forks source link

Can't reproduce BM25 baselines #143

Closed zhiyuanpeng closed 1 year ago

zhiyuanpeng commented 1 year ago

Hi

I instal elacticsearch on debian 10:

{
  "name" : "xx",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "iE5FrFezRoWh0J-bp1Zb5g",
  "version" : {
    "number" : "7.17.10",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "fecd68e3150eda0c307ab9a9d7557f5d5fd71349",
    "build_date" : "2023-04-23T05:33:18.138275597Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

I run the BM25 on BEIR MS-MARCO dataset, and I obtain NDCG@10: 0.4769 which is much bigger than the score 0.228 in table 2 of your paper. On scifact, I get NDCG@10 0.6906 which is also much higher than 0.665 in table 2. Any suggestions? Thanks

zhiyuanpeng commented 1 year ago

I re-run the notebook and get a very closed NDCG@10 score: 0.69064 on scifact. The two NDCG@10 scores I get are much bigger than 0.665 reported in table 2

nreimers commented 1 year ago

For msmarco you have to run it on the dev split

zhiyuanpeng commented 1 year ago

@nreimers

Thanks for your reply. On msmarco dev, I got a very closed NDGC@10 score: 0.22747. I am confused about reporting the results on dev dataset instead of the test dataet. The training script of sbert evaluates the model on dev dataset during the training. Why report the results on dev? BTW, could you make it clear that how to use your splits. like which split is for training, which split is for evaluation during the training and which file is for final testing. Thank your very much!

nreimers commented 1 year ago

Msmarco doesn't have a test set. here in BEIR the test split is TREC DL 2019. It is quite confusing.

People report for msmarco on dev set. This dev set shouldn't be used for training/stopping etc

zhiyuanpeng commented 1 year ago

@nreimers Thank you for the clarification! BTW, I can's reproduce your BM25 NDCG@10 on FEVER. I run your notebook and get NDCG@10 0.64938 which is much smaller than 0.753 reported in table 2. On dev, the NDCG@10 is 0.66363 which is also much smaller than 0.753.

update: BEIR reports Anserini BM25. I will run it. Thanks.

lintool commented 1 year ago

To reproduce official BEIR scores, Pyserini is probably easier... https://github.com/castorini/pyserini/

Specifically, try: https://castorini.github.io/pyserini/2cr/beir.html

You'll be able to get the scores on the official BEIR leaderboard: https://eval.ai/web/challenges/challenge-page/1897/overview

cc/ @thakur-nandan

zhiyuanpeng commented 1 year ago

@lintool Thanks. My reproduced NDCG@10 on FEVER is much closer to your result 0.6513. BEIR utilizes Anserini BM25. I will run it to reproduce the results.