allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

Google Natural Questions #57

Closed seanmacavaney closed 3 years ago

seanmacavaney commented 3 years ago

As requested by an anonymous SIGIR reviewer.

Dataset Information:

Google NQ is a question answering dataset sourced from Google's query log, with answers from Wikipedia articles.

Links to Resources:

Dataset ID(s):

Supported Entities

Additional comments/concerns/ideas/etc.

They suggest using gsutil to download the data, but I'd rather not add that (and all its dependencies). It looks like we may be able to build up URLs like: https://storage.cloud.google.com/natural_questions/? (thanks @andrewyates)

seanmacavaney commented 3 years ago

So here's a proposal:

Step 1: Download train and dev files one by one from https://storage.googleapis.com/natural_questions/v1.0/train/nq-train-00.jsonl.gz (note: storage.cloud.google.com requires authentication, storage.googleapis.com does not). There are 50 train files and 5 dev files, totaling 42GB. But we do not need to keep them around after this processing. Note that the test set is not available.

Step 2: Build up documents based on the passages provided in each record's long_answer_candidates. Note that these can overlap, but top_level on each record indicates if it's a top-most candidate passage.

docs will look like:

NqPassage:
  doc_id: str # a sequentially-assigned document ID (unique based on URL) + the index of the passage
  text: str # tokenized text of the passage, with all HTML tokens removed
  html: str # raw HTML of the passage
  start_byte: int # the following are from the `long_answer_candidates` objects and may be useful for something
  end_byte: int
  start_token: int
  end_token: int
  document_title: str # from document itself
  document_url: str # from document itself
  parent_doc_id: str # doc_id of the largest passage it's under (e.g., a sentence under a paragraph), or None if it's a top-level passage

Step 3: queries simply from question_text, as a GenericQuery

Step 4: qrels assigned based on annotation long_answer. There can be multiple, but there's usually just 1. Sometimes there's none (indicated with a long_answer.candidate_index==-1. Short answers (answer spans) and yes_no_answer also included in the qrels object. From my tests, it doesn't appear that there's ever a short answer if there's no long answer.

qrels will look like:

NqQrel:
  query_id: str
  doc_id: str
  relevance: int # always 1
  short_answers: List[str] # the **string** representations of the answers (this is similar to how DPH evaluates)
  yes_no_answer: str

Step 5: scoreddocs assigned based on the list of all passages for the corresponding document.