Reproducability for all-minilm-l6-v2

beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Apache License 2.0

1.62k stars 192 forks source link

Hello, had a look at the results of comparing gpt-3, google and sentence transformer embedding models against each other on a few benchmarks from BEIR (https://docs.google.com/spreadsheets/d/1cbkJAinXVKIzf6oZ1lAR_feD4rw1vXFSjCjb0AJ83jI/edit#gid=0). I'm just wondering how the results for the sentence transformer models could be reproduced.

I used the evaluation SBERT script with all-MiniLM-L6-v2 and end up with a score of 0.18 for SciDocs and around 0.45 in average for CQADupStack. While also the leaderboard shows results in the range of 0.14 and 0.21 for SciDocs both results are far away from the scores mentioned in the spreadsheet above.

Can someone help me and point me into the right direction?

beir-cellar / beir

Reproducability for all-minilm-l6-v2 #129