drogozhang / LED

Source code of paper 'LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval' (WWW 2023)
22 stars 1 forks source link

how to train model with customed datasets #1

Closed liulizuel closed 1 year ago

liulizuel commented 1 year ago

Thanks for your contribution and great work! I have some questions, How to train model with customed datasets? I am going to train the whole model with Chinese retrieval datasets. All I need to do :

  1. transform my training data to customized formatted data
  2. initialize the backbone bert-like model with Chinese bert-like model such as bert-base-chinese
  3. train step by step

Am I right? And what the format for the customized training data?

drogozhang commented 1 year ago

Hi, Thanks for your interest in our work. You are 100% correct. For data format, you can follow the below:

$DATA_DIR/
--msmarco/
----collection.tsv
----collection.tsv.title.tsv (titles, copied from https://github.com/texttron/tevatron)
----passage_ranking/
------train.query.txt [502939 lines]
------qrels.train.tsv [532761 lines] 
------train.negatives.tsv [400782 lines] (BM25 negatives, copied from tevatron)
------dev.query.txt [6980 lines]
------qrels.dev.tsv [7437 lines] 
------top1000.dev [6668967 lines] 
------test2019.query.txt [200 lines]  
------qrels.test2019.tsv [9260 lines] 
------top1000.test2019 [189877 lines] 

You can download these from https://microsoft.github.io/msmarco/Datasets, and then rename it accordingly.

Hope it helps!

Best, Kai

liulizuel commented 1 year ago

Thank you very much~

drogozhang commented 1 year ago

Question answered.