Metrics of Table 1 - Githubissues

okhat commented 3 years ago

I'm trying to precisely interpret Table 1, looking at the arXiv version of the paper.

For HotPotQA, you report R@20 of 80.2%. This is defined as:

On HotpotQA the metric is recall at the top k paragraphs. Footnote: As the sequence length is 2 for HotpotQA, we pick the top k/2 sequences predicted by MDR.

In this case, you take the top-10 passage-pairs of MDR, create a single set of passages (up to 20 passages), and check if both gold passages are contained in this set. Is that correct? (In other words, it's not necessary that one of the top-10 pairs is the correct pair.)
As these sequences tend to contain duplicated passages, the total number of passages is usually fewer than 20. Is that also correct?
The re-ranking results in Table 2 use the best ELECTRA-large re-ranker, and not a BERT-base re-ranker. Is that right? If so, how many sequences are re-ranked? Is it 50, 100, or 250?

Lorraine333 commented 3 years ago

To answer your question: 1. Yes, that is correct. 2. There are very few duplicated passages according to our analysis 3. Yes, it is using ELECTRA-large re-ranker with top 100 sequences. (We have corrected the misleading context in the camera-ready version of the paper, sorry about the confusion.)

okhat commented 3 years ago

Thank you Lorraine! Closing; will re-open if I have other related questions.

facebookresearch / multihop_dense_retrieval

Metrics of Table 1 #5