In this thread: https://github.com/facebookresearch/DPR/issues/42, it was mentioned that the pages were split into 100 word passages using spaCy en-web tokenizer. I tried to reproduce this myself, counting if a token was a word using is_alpha from spaCy, but my passages on average were slightly longer than the Wikipedia DPR 100 word passages. Could you elaborate on how the tokenizer was used to count 100 words please?
Hi,
In this thread: https://github.com/facebookresearch/DPR/issues/42, it was mentioned that the pages were split into 100 word passages using spaCy en-web tokenizer. I tried to reproduce this myself, counting if a token was a word using is_alpha from spaCy, but my passages on average were slightly longer than the Wikipedia DPR 100 word passages. Could you elaborate on how the tokenizer was used to count 100 words please?