beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.49k stars 177 forks source link

Reproducability for all-minilm-l6-v2 #129

Open Manikseam opened 1 year ago

Manikseam commented 1 year ago

Hello, had a look at the results of comparing gpt-3, google and sentence transformer embedding models against each other on a few benchmarks from BEIR (https://docs.google.com/spreadsheets/d/1cbkJAinXVKIzf6oZ1lAR_feD4rw1vXFSjCjb0AJ83jI/edit#gid=0). I'm just wondering how the results for the sentence transformer models could be reproduced.

I used the evaluation SBERT script with all-MiniLM-L6-v2 and end up with a score of 0.18 for SciDocs and around 0.45 in average for CQADupStack. While also the leaderboard shows results in the range of 0.14 and 0.21 for SciDocs both results are far away from the scores mentioned in the spreadsheet above.

Can someone help me and point me into the right direction?

thakur-nandan commented 1 year ago

Hi @Manikseam, The all-MiniLM-L6-v2 is sentence-level model which suffers on retrieval tasks in BEIR, which require passage-level understanding for retrieval of relevant passages for a given query.

The first spreadsheet which you shared contains tasks only for sentence-level where all-MiniLM-L6-v2 perform well.

I would suggest you can have a look at models such as TAS-B to achieve scores mentioned in the BEIR leaderboard.

Kind Regards, Nandan Thakur