Closed huvunvidia closed 3 years ago
Hi @huvunvidia , Sorry for a long delay. retriever_data & biencoder_data are utility classes to work with the corresponding datasets at runtime - training or inference. They are not standalone scripts and they were not used to prepare the data. We don't provide the code for data preparation or getting bm25 results to avoid dependencies to Java and other libs. We largely followed the data preparation process and scripts using DrQA codebase + WikiExtractor
@vlad-karpukhin thank you very much for your response. Now that I understand the pre-processing codes followed DrQA codebase + WikiExtractor, how about BM25? Can you tell me which library/software you used to run BM25? Thank you very much.
Hi @huvunvidia , We just used Lucene with the default tf-idf scoring settings. Better BM25 results can be achieved using a tuned version from Pyserini: https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md
Thank you for the codes. I wonder if you can share mode detailed instructions on how to run "retriever_data.py" and "biencoder_data.py" to correctly reproduce the processed data for training DPR. I find many methods implemented in these 2 files but I don't know where to start as well as the arguments for the methods. Also, I wonder if the code provides the BM25 retrieval steps for the initial DPR training set. Can you share it? Thank you very much.