Question on corpus size

jzhoubu commented 3 years ago

Hi, thanks for sharing. I recently found some papers conduct experiments on NQ dataset with a 12M corpus rather than 21M as yours. Would you share more information on how to construct the 21M corpus and what might be the major difference among these corpus? Thanks much.

info quote from paper[1]:

we use the 12-20-2018 snapshot of English Wikipedia as our open-domain QA corpus. When splitting the documents into chunks, we try to reuse the original paragraph boundaries and create a new chunk every time the length of the current one exceeds 256 tokens. Overall, we created 12,494,770 text chunks, which is on-par with the number (13M) reported in previous work.

[1] Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering [link] [2] Latent Retrieval for Weakly Supervised Open Domain Question Answering [link]

vlad-karpukhin commented 3 years ago

We used WikiExtractor lib and DrQA library wikipedia processing methods to extract text from wikipedia pages. We used default WikiExtractor settings which skips lists, tables, infoboxes and retrieves only paragraphs text. Then we concatenated all wikipedia page text and split it to 100 words passages without using any fancy tokenizers, just spaces as word boundaries.

jzhoubu commented 3 years ago

We used WikiExtractor lib and DrQA library wikipedia processing methods to extract text from wikipedia pages. We used default WikiExtractor settings which skips lists, tables, infoboxes and retrieves only paragraphs text. Then we concatenated all wikipedia page text and split it to 100 words passages without using any fancy tokenizers, just spaces as word boundaries.

@vlad-karpukhin Thanks for your detailed reply. I am trying to reproduce the result on other datasets(TriviaQA/WebQuestions/CuratedTREC/SQuAD), and have some followed-up questions on this:

I want to confirm whether processed dataset CuratedTREC and WebQuestions dataset will be available later or not (as issue #76 mentioned last year)
According to the paper, DPR behaves better on the NQ dataset without distant supervision. I wonder if there are any experiments you have done on other datasets, to verify performance with v.s. without distant supervision?
Further, to reproduce TriviaQA and SQuAD, I want to confirm if I need to train DPR under distant supervision setting? (i.e. should I set shuffle_positive_ctx to False for TriviaQA and SQuAD dataset?)

Thanks.

vlad-karpukhin commented 3 years ago

Hi @sysu-zjw ,

I do have CuratedTREC & WebQuestions datasets, but they are not in the current DPR json format. The conversion is trivial, I'm just super busy now, but can add those in a week or so - please feel free to remind me here if I don't provide updates.
No I just did that experiment on NQ sicne it has gold passages and is the most challenging across our Q&A tasks set.
No I haven't tried shuffle_positive_ctx=True for TriviaQA and SQuAD.

vlad-karpukhin commented 3 years ago

Hi @sysu-zjw , The datasets have been added See https://github.com/facebookresearch/DPR/issues/143

jzhoubu commented 3 years ago

Hi @sysu-zjw , The datasets have been added See #143

Thanks, @vlad-karpukhin . I close this issue.

facebookresearch / DPR

Question on corpus size #138