facebookresearch / KILT

Library for Knowledge Intensive Language Tasks
MIT License
894 stars 90 forks source link

Fail to reproduce 22,220,793 passages for DPR #52

Closed jzhoubu closed 2 years ago

jzhoubu commented 2 years ago

Hi, according to the paper, there should be 22,220,793 passages in the KILT knowledge source. However, my reproduction results in 24,853,658 passages.

I count the passage number using the code below

passage_num = 0
for sample in kilt_wiki["full"]:
    text = [x.strip() for x in sample["text"]["paragraph"] if "BULLET::::" not in x]
    word_num = len(" ".join(text).split())
    passage_num += word_num//100 + int(bool(word_num%100))

Could you share the 22,220,793 passages or give more detail for reproduction?

jzhoubu commented 2 years ago

the repo of DPR provides a variant of KILT passage collections, which can be found here.

According to the KILT paper, I have tried different ways to reproduce the wiki corpus, but the number of passages fails to match 22,220,793 (reported by the KILT paper). For now, I don't see there is any way to reproduce the result reported by the KILT paper, so I will close this issue.

vlad-karpukhin commented 2 years ago

Hi @sysu-zjw , The passages in the KILT format as well as the wikipedia snapshot are different rom DPR DPR repo provides KILT based wikipedia snapshot as well BUT represented in the DPR's 100-words passages format which is different from KILT.