Closed yeliu918 closed 3 years ago
Hi @yeliu918 , we don't plan to share the preprocessing code - it was done my multiple people, pretty messy, includes Java code for bm25 index processing, etc. We don't want to bring Java dependency in this repo or provide the code which quality is not up to the mark and we don't have time to modify it to make sharable. Sorry.
I understand. Thank you for letting me know.
In this case, could you provide the processed data of WebQuestions, CuratedTREC, and SQuAD v1.1 dataset? Since I didn't find them in the download processed file. Thanks!
Squad is provided: use https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-{train|dev}.json.gz link The other datasets are in different format only and will required some additional data formatting to be compatible with the current codebase. Let me know if you will to do that
Thanks for sharing the link.
May I ask what's the process you did on the Wikipedia dump? Because I get 75M rather than 21M passages after the output of the Wikiextractor. I guess you did some filters on the passage.
Yes we used some filtering scripts taken from DrQA(https://github.com/facebookresearch/DrQA) custom processing and then split the text into 100-words passages. We did that a while ago and it is pretty hard to to restore exact steps we did.
Hi there,
Great work and nice project! Thank you so much for making it public and everything works well.
I'm asking is there possible to publish the code in the preprocessing. Especially,
Best, Ye