NQ Dataset - Githubissues

facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Other

1.73k stars 304 forks source link

I no that this code base is no longer supported but I have a couple questions about the NQ dataset.

The official dataset page says "Natural Questions contains 307K training examples, 8K examples for development, and a further 8K examples for testing". However, the DPR paper reports that the NQ dataset is much smaller (unfiltered training set is 79,168, filtered training set is 58,880, dev set is 8,757 and test set is 3,610). Why is this the case? Are you using an older version of the NQ dataset?

Also, I downloaded the datasets using your scripts. The training set does indeed have 58,880 samples but the dev set only has 6515 sample. Why are some of the samples missing from the dev test?

facebookresearch / DPR

NQ Dataset #221