Query on Table 4 - Githubissues

robinsongh381 commented 3 years ago

Hello !

I could not find any explicit mention from the paper that whether the DPR result for the Table 4 is from dev or test set.

I suspect, reader models were examined on test datasets.

If so, do they correspond to a json output which is generated upon running dense_retriever.py against data/data/retriever/qas/{nq, trivia, squad1}-test.csv ?

Thank you

vlad-karpukhin commented 3 years ago

Hi, Results in the paper are from test sets (official dev for NQ). Our dev split for NQ is our own and doesn't correspond to that mentioned in google papers like ORQA or REALM.

robinsongh381 commented 3 years ago

Alright, thanks.

I have further questions regarding your answe.

When you make the Table 4, did you then train with "official" NQ/Trivia/SQuAD/WQ/TREC train datasets rather than provided datasets from download_data.py ? (Eg. data.retriever.qas.nq-train or data.retriever.qas.trivia-train)
If so, how did you mange to use gold_passages_info ? Or you did not use it for "official" train datasets ?

After some search, I found this link provides "official" train & dev sets for NQ dataset.

However, as expected, the dev set (as well as train set) does not include ctxs information whereas the outcome of dense_retriever.py (which is used as dev set in case of your repo) does include ctxs.
Also, since the "official" dev set and your designed dev set are different, this it mean that your provided gold_passages_src would no longer valid for "official" dev set ?

What I'm currently doing is to propose a new model for Retriever (a modification of Bi Encoder). I trained this model with retrieval tasks against NQ/Trivia QA/SQuAD datasets and apparently obtained better top-k accuracy than reproduced DPR's result.

After this, I wanted to perform QA experiments with the 3 datasets but I'm not sure how to make a "fair" comparison. If I use the outcome of dense_retriever.py, the result would not be fair since test set would be different from yours (and others as well). However, If I use the "official" test set (as linked above) its format seems to be incompatible with current preprocess. (due to the reason 1 above)

Please give me your advice.

Thanks

vlad-karpukhin commented 3 years ago

Hi @robinsongh381 , Sorry for a late reply, just noted your follow-up questions.

All QAS (questions & answers) data files are only for inference/evaluation, they don't have any passages, just questions & answers. So one can't use them for any training. All our train sets used for training are subsets (sometimes full) of the official train sets. All our 'qas' train sets are the official train splits but without passages. For some datasets like NQ, we lost part of the official train samples because of gold passages mapping.
I don't understand this question. Can you elaborate?
As of NQ open - it wasn't officially ready for downloading when we were doing DPR. We just followed exact same process described in ORQA paper to extract this dataset from full NQ data and got same total amount of (train+dev) & test splits like in the google's paper, just our dev split is different (i.e. it has same amount of samples but they are different in general). dense_retriever.py doesn't know anything about official ctx, it just returns best candidates from our wikipedia index, which has nothing to do with 'official' NQ ctxs (or as they call them candidate spans).
The usage of gold_passages_src is NQ-only heuristic and is used just to en-force injecting gold ctx+ into the reader training process and select other ctx+ and ctx- based on the link to the gold wikipedia page. "your provided gold_passages_src would no longer valid for "official" dev set" - yes but this task's NQ dev set is only used for checkpoint selection and we don't use its results for any comparison with other papers.

Please note that gold_passages_src used for reader data pre-processing is not the same step when I mention "gold ctx+ mapping to our wikipedia split"

To summarize NQ splits processing pipeline:

Full official NQ train set -> filter open domain task samples -> split to train & dev splits -> these are 'qas' files.
Do gold ctx+ mapping for train/dev sets from 1 above to generate bi-encoder training data. this step looses ~25% of samples. These are train&dev DPR json files.
Full official NQ dev set -> filter open domain task samples -> this is test split 'qas' file.

facebookresearch / DPR

Query on Table 4 #126