facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.71k stars 301 forks source link

Ask for Preprocessing Code #111

Closed yeliu918 closed 3 years ago

yeliu918 commented 3 years ago

Hi there,

Great work and nice project! Thank you so much for making it public and everything works well.

I'm asking is there possible to publish the code in the preprocessing. Especially,

  1. the code from the output ofWikiExtractor to wikipedia_split/psgs_w100.tsv.
  2. the code of building QA dataset, how to get the "positive_ctxs", "negative_ctxs" and "hard_negative_ctxs".

Best, Ye

vlad-karpukhin commented 3 years ago

Hi @yeliu918 , we don't plan to share the preprocessing code - it was done my multiple people, pretty messy, includes Java code for bm25 index processing, etc. We don't want to bring Java dependency in this repo or provide the code which quality is not up to the mark and we don't have time to modify it to make sharable. Sorry.

yeliu918 commented 3 years ago

I understand. Thank you for letting me know.

In this case, could you provide the processed data of WebQuestions, CuratedTREC, and SQuAD v1.1 dataset? Since I didn't find them in the download processed file. Thanks!

vlad-karpukhin commented 3 years ago

Squad is provided: use https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-squad1-{train|dev}.json.gz link The other datasets are in different format only and will required some additional data formatting to be compatible with the current codebase. Let me know if you will to do that

yeliu918 commented 3 years ago

Thanks for sharing the link.

May I ask what's the process you did on the Wikipedia dump? Because I get 75M rather than 21M passages after the output of the Wikiextractor. I guess you did some filters on the passage.

vlad-karpukhin commented 3 years ago

Yes we used some filtering scripts taken from DrQA(https://github.com/facebookresearch/DrQA) custom processing and then split the text into 100-words passages. We did that a while ago and it is pretty hard to to restore exact steps we did.