Steps for reproduce processed data for training

facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Other

1.71k stars 301 forks source link

Steps for reproduce processed data for training #168

Closed huvunvidia closed 3 years ago

huvunvidia commented 3 years ago

Thank you for the codes. I wonder if you can share mode detailed instructions on how to run "retriever_data.py" and "biencoder_data.py" to correctly reproduce the processed data for training DPR. I find many methods implemented in these 2 files but I don't know where to start as well as the arguments for the methods. Also, I wonder if the code provides the BM25 retrieval steps for the initial DPR training set. Can you share it? Thank you very much.

vlad-karpukhin commented 3 years ago

Hi @huvunvidia , Sorry for a long delay. retriever_data & biencoder_data are utility classes to work with the corresponding datasets at runtime - training or inference. They are not standalone scripts and they were not used to prepare the data. We don't provide the code for data preparation or getting bm25 results to avoid dependencies to Java and other libs. We largely followed the data preparation process and scripts using DrQA codebase + WikiExtractor

huvunvidia commented 3 years ago

@vlad-karpukhin thank you very much for your response. Now that I understand the pre-processing codes followed DrQA codebase + WikiExtractor, how about BM25? Can you tell me which library/software you used to run BM25? Thank you very much.

vlad-karpukhin commented 3 years ago

Hi @huvunvidia , We just used Lucene with the default tf-idf scoring settings. Better BM25 results can be achieved using a tuned version from Pyserini: https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md