Strange NDCG@10 for Touche-2020 on the BEIR leaderboard

beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

http://beir.ai

Apache License 2.0

1.55k stars 186 forks source link

Strange NDCG@10 for Touche-2020 on the BEIR leaderboard #74

Open thigm85 opened 2 years ago

thigm85 commented 2 years ago

I noticed that the NDCG@10 for Touche-2020 on the BEIR leaderboard is around 0.60 for elastic bm25.

Is it correct to assume that Touche-2020 is represented by the dataset named "webis-touche2020"? If yes, I just ran the elastic search bm25 for it, and I found NDCG@10 at around 0.35, which is similar to what I got with Vespa.

Any thoughts?

cadurosar commented 2 years ago

Recently I had the same problem as you. It is linked to the fact that there are two versions of the webis-touche2020, with the newer one being the one used now (around 0.35) and the older one having better scores (0.6). In issue #11 it seems that the older version was kept, but then in issue #40 the new versions seems to have taken over as the default one, making some numbers of the benchmark obsolete (reranking page mostly, sparse and dense seem to be using the new version).

thigm85 commented 2 years ago

Got it. Thanks for the reply @cadurosar.

thigm85 commented 2 years ago

Does the same issue happen with msmarco? I just ran the Elastic Search BM25 with msmarco and the NDCG@10 is around 0.45 instead of the 0.22 as reported on the BEIR leaderboard. Is that correct @NThakur20?

thakur-nandan commented 2 years ago

Hi, @thigm85 and @cadurosar,

Yes, the webis-touche authors contacted us with problems in their version v1 dataset. So we kept scores on the v2 version (with no annotation errors). Some scores in the leaderboard might not be changed like @cadurosar mentioned. The leaderboard is getting revamped and soon will have the latest updated scores on it.

Regarding MSMARCO, I think you would have evaluated the test set. That's why probably you get NDCG@10 of around 0.45, however, you should evaluate the dev set instead where you should get an identical score mentioned in the leaderboard.

Kind Regards, Nandan Thakur

thigm85 commented 2 years ago

Hi @NThakur20, thanks for the clarification. Does the leaderboard use the dev set for all the datasets or only for MS MARCO?

thakur-nandan commented 2 years ago

dev set only for MSMARCO, rest are the test sets.