facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.72k stars 303 forks source link

Question about reader training. #69

Closed ReyonRen closed 3 years ago

ReyonRen commented 4 years ago

Hi, I'm here again :)

I tried to use the test data constructed by my retrieved passage in NQ dataset to test the reader model trained by your provided training data, but the effect is not very good although it has pretty good retrieval performance.

I feel that the problem maybe that the training data does not match my data, so I would like to ask how your training data of reader is structured? Such as what is the query and the passage source?

Thank you!

ReyonRen commented 4 years ago

I found that there used gold-info file when constructing pkl file. What is its role in constructing pkl files?

If I construct the training data of reader by myself, can I keep this file unchanged, or what should I note?

The retriever training data I used was released by your DPR work, and finally we used 58812 queries of them.

ReyonRen commented 4 years ago

Or maybe I made some mistake when I construct test pkl file?

I used the top 100 retrieved results of 3610 test data and construct a json file same as your nq-test.json file. And I ignore "has_answer" part, I set it to 1 with all passages, I don't know if it has affect to result. Then I used your nq-test_gold_info.json as a input to "preprocess_reader_data.py" to preprocess pkl file.

Did I make some mistake?

vlad-karpukhin commented 4 years ago

Our reader training data is taken directly from retriever results - the is the idea of our paper. But for the reader training and those datasets which have gold passages provided, we have some special reader data filtering logic.

vlad-karpukhin commented 4 years ago

gold-info is the parameter to those datasets with ctx+ provided - reader training only data composition logic has some special heuristics when those are available.

ReyonRen commented 4 years ago

Get it, so I need to construct a gold-info file of my training data that each query contains a positive passage (first passage of positive_ctxs in your retriever training data)?

Thank you so much for your friendly help!

P.S. So it's OK I used same test-gold-info same as yours? Since the test data seems seem that contains 3610 queries.

vlad-karpukhin commented 4 years ago

Hi, gold_passages_src & gold_passages_src_dev are not "needed". It can only improve model training slightly and is applicable for train/dev set ONLY. It should NOT be used for the test set.

ReyonRen commented 4 years ago

Hi Vladimir, I have a question that during the reader training process, is the passage of your training data the top 100 retrieved by the query? Is there an artificial control of the ratio of positive and negative examples?

Thank you!

vlad-karpukhin commented 4 years ago

Hi, yes, there are a set of heuristics we use to convert the retriever results into reader training batches. As mentioned above, if the gold positive pages are available, we use retrieved positive passage from them only. If not, we use top-k positive (containing answer) retrieved passages. You can have a detailed look at our code starting at https://github.com/facebookresearch/DPR/blob/master/dpr/data/reader_data.py#L103

ReyonRen commented 4 years ago

Thank you!

I have another question again :)

In https://github.com/facebookresearch/DPR/blob/27a8436b070861e2fff481e37244009b48c29c09/dpr/data/reader_data.py#L96 I'd like to ask the effect of include_gold_passage, I notice that it is default to be False, how the performance change if I change it to True?

Thank you again!

vlad-karpukhin commented 4 years ago

Hi, we planned to evaluate that idea of forcing gold passages into the training process but we actually never did that experiment. Based on our experience with forcing gold positive contexts into the training process of the retriever model and the heuristics used for the reader training, It may probably give you 0.5-1 exact match point gain but hardly more.

vlad-karpukhin commented 3 years ago

guess I can close this now