Train-test split of the TREC dataset

brmson / dataset-factoid-curated

A curated question answering research dataset of factoid questions

49 stars 18 forks source link

Train-test split of the TREC dataset #1

Open Victor0118 opened 5 years ago

Victor0118 commented 5 years ago

In some open domain QA papers, I saw the CuratedTREC dataset is used and linked here. But I cannot find the train/test split here. Even more surprisingly, I find the statistics of the train/test splits in two papers are different:

https://arxiv.org/pdf/1709.00023.pdf: 1204/694
https://arxiv.org/pdf/1704.00051.pdf: 1486/694

Does anyone know how to solve this problem?

jhyuklee commented 4 years ago

I guess the split is based these two files: large2470-test.tsv and large2470-train.tsv (Large Variant of the Dataset) excluding QA pairs with 'lfb' ids (QA pairs from live.ailao.eu I guess. see https://github.com/brmson/dataset-factoid-curated/commit/d81aca55d9afdc9b541ce403b4d346e63375db6b).

The numbers from DrQA paper are correct in this case, but I'm not sure where the number 1204 comes from in the R^3 paper.