castorini / bertserini

BERTserini
https://github.com/castorini/bertserini
Apache License 2.0
25 stars 10 forks source link

Question about retriever.. #12

Closed gabinguo closed 3 years ago

gabinguo commented 3 years ago

A silly question about the retriever part:

I was trying to index the Wikipedia dump in paragraph-level as you did. And in the paper, you mentioned that you got 29.5M paragraphs, but instead I got 33.3M paragraphs. So, I would like to ask if you did some special filter setting when you split the article into paragraphs or just easily split them by article.split("\n")

Thanks

amyxie361 commented 3 years ago

Hi, we use "\n\n"to split the paragraphs. And we use 2016-12-21 Wikipedia dump and follow the preprocess here.

gabinguo commented 3 years ago

Hi, we use "\n\n"to split the paragraphs. And we use 2016-12-21 Wikipedia dump and follow the preprocess here.

Exactly what I am looking for. Thanks a lot.