castorini / pygaggle

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini
http://pygaggle.ai/
Apache License 2.0
340 stars 100 forks source link

Duplicated answer in the QA dataset #22

Closed gabrer closed 4 years ago

gabrer commented 4 years ago

First, great work with the QA dataset and thanks for sharing!

I found the following answer is (wrongly) duplicated in the dataset:

"id":"o56j4qio", "title":"Journal Pre-proof Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and meta-analysis Prevalence of comorbidities in the Novel Wuhan Coronavirus (COVID-19) infection: a systematic review and meta-analysis", "exact_answer":"OR 2.07, 95% CI: 0.89-4.82"

Also, it might be helpful to specify that the ID refers to the context article rather than identifying uniquely the answer.

Hope this helps to review the dataset for future versions.

lintool commented 4 years ago

Thanks for catching this!

daemon commented 4 years ago

Closed in #23