Closed jzhoubu closed 3 years ago
We used WikiExtractor lib and DrQA library wikipedia processing methods to extract text from wikipedia pages. We used default WikiExtractor settings which skips lists, tables, infoboxes and retrieves only paragraphs text. Then we concatenated all wikipedia page text and split it to 100 words passages without using any fancy tokenizers, just spaces as word boundaries.
We used WikiExtractor lib and DrQA library wikipedia processing methods to extract text from wikipedia pages. We used default WikiExtractor settings which skips lists, tables, infoboxes and retrieves only paragraphs text. Then we concatenated all wikipedia page text and split it to 100 words passages without using any fancy tokenizers, just spaces as word boundaries.
@vlad-karpukhin Thanks for your detailed reply. I am trying to reproduce the result on other datasets(TriviaQA/WebQuestions/CuratedTREC/SQuAD), and have some followed-up questions on this:
shuffle_positive_ctx
to False for TriviaQA and SQuAD dataset?)Thanks.
Hi @sysu-zjw ,
Hi @sysu-zjw , The datasets have been added See https://github.com/facebookresearch/DPR/issues/143
Hi @sysu-zjw , The datasets have been added See #143
Thanks, @vlad-karpukhin . I close this issue.
Hi, thanks for sharing. I recently found some papers conduct experiments on NQ dataset with a 12M corpus rather than 21M as yours. Would you share more information on how to construct the 21M corpus and what might be the major difference among these corpus? Thanks much.
info quote from paper[1]:
[1] Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering [link] [2] Latent Retrieval for Weakly Supervised Open Domain Question Answering [link]