Closed seanmacavaney closed 3 years ago
So here's a proposal:
Step 1: Download train and dev files one by one from https://storage.googleapis.com/natural_questions/v1.0/train/nq-train-00.jsonl.gz (note: storage.cloud.google.com requires authentication, storage.googleapis.com does not). There are 50 train files and 5 dev files, totaling 42GB. But we do not need to keep them around after this processing. Note that the test set is not available.
Step 2: Build up documents based on the passages provided in each record's long_answer_candidates
. Note that these can overlap, but top_level
on each record indicates if it's a top-most candidate passage.
docs will look like:
NqPassage:
doc_id: str # a sequentially-assigned document ID (unique based on URL) + the index of the passage
text: str # tokenized text of the passage, with all HTML tokens removed
html: str # raw HTML of the passage
start_byte: int # the following are from the `long_answer_candidates` objects and may be useful for something
end_byte: int
start_token: int
end_token: int
document_title: str # from document itself
document_url: str # from document itself
parent_doc_id: str # doc_id of the largest passage it's under (e.g., a sentence under a paragraph), or None if it's a top-level passage
Step 3: queries simply from question_text
, as a GenericQuery
Step 4: qrels assigned based on annotation long_answer
. There can be multiple, but there's usually just 1. Sometimes there's none (indicated with a long_answer.candidate_index==-1
. Short answers (answer spans) and yes_no_answer
also included in the qrels object. From my tests, it doesn't appear that there's ever a short answer if there's no long answer.
qrels will look like:
NqQrel:
query_id: str
doc_id: str
relevance: int # always 1
short_answers: List[str] # the **string** representations of the answers (this is similar to how DPH evaluates)
yes_no_answer: str
Step 5: scoreddocs assigned based on the list of all passages for the corresponding document.
As requested by an anonymous SIGIR reviewer.
Dataset Information:
Google NQ is a question answering dataset sourced from Google's query log, with answers from Wikipedia articles.
Links to Resources:
Dataset ID(s):
nq
(docs)nq/train
(docs, queries, qrels)nq/dev
(docs, queries, qrels)Supported Entities
Additional comments/concerns/ideas/etc.
They suggest using? (thanks @andrewyates)
gsutil
to download the data, but I'd rather not add that (and all its dependencies). It looks like we may be able to build up URLs like: https://storage.cloud.google.com/natural_questions/