Fine tune DPR - Githubissues

santhoshkolloju commented 4 years ago

I have around 5000 Question and passage pairs with me and would like to fine tune DPR on my data. I didn't find any script to convert QA pairs to format which DPR uses for training (with positive and negative context).

Can some one please help me with this.

vlad-karpukhin commented 4 years ago

Hi Santhosh, The preprocessing we did to mine hard negative passages includes using Lucene for bm25 retrieval. We deliberately excluded this Lucene and thus Java dependency from the project which would make installation and setup much more complex. Also, mining hard negatives might be very task/project specific, and even if you use bm25, you might have your own database of passages you have to index it anyway from the scratch. That said, you unfortunately need to build your own pipeline to mine hard negatives and convert them to DRP json format.

As an option, you can use
our provided retriever checkpoint and generated wikipedia split encodings available (to be included into downloader tool) by: https://dl.fbaipublicfiles.com/dpr/data/wiki_encoded/single/nq/wiki_passages_{0 to 49} # 50 links

santhoshkolloju commented 4 years ago

Will it be sufficient if a provide one hard negetive from bm25 search results. As I understand from the paper negetive samples will taken from the passages in particular batch.

[ { "question": "....", "answers": ["...", "...", "..."], "positive_ctxs": [{ "title": "...", "text": "...." }], "negative_ctxs": ["..."], "hard_negative_ctxs": ["..."] }, ... ]

negative_ctxs list can be empty right?

vlad-karpukhin commented 4 years ago

"Will it be sufficient if a provide one hard negetive from bm25 search results." - that depends on your task and data, you may adjust exact amount of negatives and hard negatives by specifying command line --hard_negatives and --other_negatives. They all can be 0 as well. From our experiment, we found that adding hard negatives generally help to improve retrieval performance.

santhoshkolloju commented 4 years ago

Hi @vlad-karpukhin ,

Need some suggestion on choosing hard negatives. Can i use results from elastic search and choose one example which has really low score?

Thanks

vlad-karpukhin commented 4 years ago

"Can i use results from elastic search and choose one example which has really low score" - one should use highest score bm25 results which don't have answers as "hard negatives"

vlad-karpukhin commented 4 years ago

Hi Vinicius,

Thanks for drawing our attention to this missing data. Yes, the links above are for bi-encoder trained on NQ dataset. I've just added the resource key to the downloader tool.

Regards, Vlad

On Tue, Jun 30, 2020 at 12:37 AM vinicius-cleves notifications@github.com wrote:

As an option, you can use our provided retriever checkpoint and generated wikipedia split encodings available (to be included into downloader tool) by:

https://dl.fbaipublicfiles.com/dpr/data/wiki_encoded/single/nq/wiki_passages_{0 to 49} # 50 links

Hi Vlad,

This link you provided contains the encoded vectors for psgs_w100.tsv, right? It would definitely helps anyone who is coming up to see this on the README.md page, I think you should consider adding it there while you don't manage adding it do the download helper. I was struggling to get processing power to encoding these myself.

Congratulations on your great work and thanks for making all so available and portable to the community

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/DPR/issues/20#issuecomment-651608893, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJ5QAMSRO5GLPGXYFW3ITRZGI4ZANCNFSM4OLAHBFQ .

facebookresearch / DPR

Fine tune DPR #20