allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

Downloading Natural Questions Dev also grabs Train #236

Open kyleclo opened 1 year ago

kyleclo commented 1 year ago

Describe the bug I'm running:

nq = ir_datasets.load('natural-questions/dev')
nq_query_id_to_query = {q.query_id: q.text for q in nq.queries_iter()}

But it's not only grabbing Dev but also Train

[INFO] [starting] processing nq
[INFO] [starting] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-00.jsonl.gz
processing nq: 1593question [01:23, 19.13question[/s](https://file+.vscode-resource.vscode-cdn.net/s), file=nq-dev-00]
[INFO] [finished] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-00.jsonl.gz: [01:23] [220MB] [2.63MB[/s](https://file+.vscode-resource.vscode-cdn.net/s)]
[INFO] [starting] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-01.jsonl.gz
processing nq: 3090question [02:39, 19.36question[/s](https://file+.vscode-resource.vscode-cdn.net/s), file=nq-dev-01]
[INFO] [finished] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-01.jsonl.gz: [01:15] [200MB] [2.65MB[/s](https://file+.vscode-resource.vscode-cdn.net/s)]
[INFO] [starting] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-02.jsonl.gz
processing nq: 4650question [04:12, 18.40question[/s](https://file+.vscode-resource.vscode-cdn.net/s), file=nq-dev-02]
[INFO] [finished] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-02.jsonl.gz: [01:32] [210MB] [2.27MB[/s](https://file+.vscode-resource.vscode-cdn.net/s)]
[INFO] [starting] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-03.jsonl.gz
processing nq: 6240question [05:45, 18.08question[/s](https://file+.vscode-resource.vscode-cdn.net/s), file=nq-dev-03]
[INFO] [finished] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-03.jsonl.gz: [01:32] [217MB] [2.34MB[/s](https://file+.vscode-resource.vscode-cdn.net/s)]
[INFO] [starting] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-04.jsonl.gz
processing nq: 7821question [07:14, 18.01question[/s](https://file+.vscode-resource.vscode-cdn.net/s), file=nq-dev-04]
[INFO] [finished] https://storage.googleapis.com/natural_questions/v1.0/dev/nq-dev-04.jsonl.gz: [01:28] [221MB] [2.49MB[/s](https://file+.vscode-resource.vscode-cdn.net/s)]
[INFO] [starting] https://storage.googleapis.com/natural_questions/v1.0/train/nq-train-00.jsonl.gz
processing nq: 13787question [12:12, 18.81question[/s](https://file+.vscode-resource.vscode-cdn.net/s), file=nq-train-00]
[INFO] [finished] https://storage.googleapis.com/natural_questions/v1.0/train/nq-train-00.jsonl.gz: [04:58] [859MB] [2.88MB[/s](https://file+.vscode-resource.vscode-cdn.net/s)]
[INFO] [starting] https://storage.googleapis.com/natural_questions/v1.0/train/nq-train-01.jsonl.gz
processing nq: 19919question [17:16, 19.22question[/s](https://file+.vscode-resource.vscode-cdn.net/s), file=nq-train-01]
...

Affected dataset(s) Natural Questions

To Reproduce See code above.

Expected behavior Should just grab Dev set only.

Additional context Add any other context about the problem here.