Different version of marco training set

beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

http://beir.ai

Apache License 2.0

1.54k stars 182 forks source link

Different version of marco training set #110

Closed jzhoubu closed 1 year ago

jzhoubu commented 1 year ago

Hi @thakur-nandan, thanks for the great work!

I am trying to reproduce some of the dense retrievers' results on the BEIR leaderboard, and I found two versions (v2 and v3) of the processed MS MARCO dataset here. I wonder which is the one used to train the dense retrievers? Thanks!

jzhoubu commented 1 year ago

Following the instructions, I have constructed both v2 and v3 datasets. Both of them have 498970 unique queries and 9144553 (query, positive_passage, negative_passage) triples. Seems they are the same dataset with different data structures. I will close this issue.