facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.73k stars 304 forks source link

NQ Dataset #221

Open varshakishore opened 2 years ago

varshakishore commented 2 years ago

I no that this code base is no longer supported but I have a couple questions about the NQ dataset.

The official dataset page says "Natural Questions contains 307K training examples, 8K examples for development, and a further 8K examples for testing". However, the DPR paper reports that the NQ dataset is much smaller (unfiltered training set is 79,168, filtered training set is 58,880, dev set is 8,757 and test set is 3,610). Why is this the case? Are you using an older version of the NQ dataset?

Also, I downloaded the datasets using your scripts. The training set does indeed have 58,880 samples but the dev set only has 6515 sample. Why are some of the samples missing from the dev test?

vladk232 commented 2 years ago

Hi @varshakishore , The Open domain version of NQ is a subset of the "main" NQ dataset and there is not direct correspondence between their dev/test splits (official dev NQ set is a test set for Open Domain NQ). You can find info about differences on the google's NQ relevant github page. As described in the paper, we used the same filtering process as in ORQA paper. Google didn't release OD version of NQ so we just repeated the same steps. But they released all the splits since then and you can just reference/download OD NQ from the official site.