facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.73k stars 304 forks source link

Question about Wikipedia Corpus Preprocessing #222

Open manveertamber opened 2 years ago

manveertamber commented 2 years ago

Hi,

In this thread: https://github.com/facebookresearch/DPR/issues/42, it was mentioned that the pages were split into 100 word passages using spaCy en-web tokenizer. I tried to reproduce this myself, counting if a token was a word using is_alpha from spaCy, but my passages on average were slightly longer than the Wikipedia DPR 100 word passages. Could you elaborate on how the tokenizer was used to count 100 words please?