Closed liulizuel closed 1 year ago
Hi, Thanks for your interest in our work. You are 100% correct. For data format, you can follow the below:
$DATA_DIR/
--msmarco/
----collection.tsv
----collection.tsv.title.tsv (titles, copied from https://github.com/texttron/tevatron)
----passage_ranking/
------train.query.txt [502939 lines]
------qrels.train.tsv [532761 lines]
------train.negatives.tsv [400782 lines] (BM25 negatives, copied from tevatron)
------dev.query.txt [6980 lines]
------qrels.dev.tsv [7437 lines]
------top1000.dev [6668967 lines]
------test2019.query.txt [200 lines]
------qrels.test2019.tsv [9260 lines]
------top1000.test2019 [189877 lines]
You can download these from https://microsoft.github.io/msmarco/Datasets, and then rename it accordingly.
Hope it helps!
Best, Kai
Thank you very much~
Question answered.
Thanks for your contribution and great work! I have some questions, How to train model with customed datasets? I am going to train the whole model with Chinese retrieval datasets. All I need to do :
Am I right? And what the format for the customized training data?