beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Training data for NQ? #36

Open ReyonRen opened 3 years ago

ReyonRen commented 3 years ago

Thanks for the great contribution!

I found that the downloaded data of NQ only contains test files and corpus, where can I get the training files?

Thank you!

thakur-nandan commented 3 years ago

Hi @ReyonRen,

You can use this dataset for NQ training: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/NQ-train_pairs.jsonl.gz

Kind Regards, Nandan

ReyonRen commented 3 years ago

Thank you very much. Is the query of this training set a subset of the passage in your open-source NQ corpus?

thakur-nandan commented 3 years ago

Hi @ReyonRen,

Actually, the evaluation corpus is a subset of the training set. Because in NQ (original dataset) often you can have duplicated pages present, i.e. identical Wikipedia pages from let's say 2014, 2015, etc.

While creating the BEIR NQ evaluation corpus, we only evaluate a single question for a Wikipedia passage, because if we add other passages with the same title but from a different year let's say 2014 or 2015, you introduce duplicates within your dataset.

However, during training, you do not care about duplicates and train with all passage and question combinations!

Kind Regards, Nandan Thakur

ReyonRen commented 3 years ago

Thank you for the kind reply!

jaxball commented 2 years ago

Hi @NThakur20, is it possible to make the preprocessing code from jsonl to TSV available for the NQ dataset? Or if the train.tsv for NQ is available for download, that'd be helpful too.

mrdrozdov commented 1 year ago

The tsv format is here: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq-train.zip

Discussed here: https://github.com/beir-cellar/beir/issues/108