beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.54k stars 182 forks source link

Marco triplet in train_msmarco_v2.py #82

Closed WenzhengZhang closed 2 years ago

WenzhengZhang commented 2 years ago

Thanks for sharing your work! I fould the triplets_url here in train_msmarco_v2 is stale. Could you please share this ms marco triplets file? Thanks!

nreimers commented 2 years ago

@WenzhengZhang You can find an updated version of the file here: https://huggingface.co/datasets/sentence-transformers/embedding-training-data/resolve/main/msmarco-triplets.jsonl.gz

Not that his file is in jsonl format, while the linked script expects tsv. Hence, you need to update you script how the triplets are loaded

Better (updated) training scripts can be found here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco

WenzhengZhang commented 2 years ago

Thanks for your help!