airKlizz / MsMarco

Re-ranking task using MS MARCO dataset and Hugging Face library
15 stars 2 forks source link

Evaluation methodology #2

Closed pommedeterresautee closed 4 years ago

pommedeterresautee commented 4 years ago

Hi,

I am playing with the dataset you have generated (the one behind this link https://drive.google.com/open?id=1-LZcCSVwejkdMg_9rnteUi2kHBZT-NP_).

I have trained a model based on Pytorch leveraging Huggingface too (I choose to strictly follow Nogueira architecture... CLS -> dropout -> 1 dense layer -> softmax -> cross ent). I wanted to compare to the score you published, but looking deeper I have again few questions :-)

My understanding is that you have generated your own eval set. This set takes only the top 50 (passage above index 50 are not included in json passage file).

Playing with the dataset I noticed that the dev file run.dev.small.tsv contains 1.5K examples where the gold doc qrels.dev.small.tsv is above index 50, therefore it is not possible to find the right answer. A concrete example: for query_id 524332, the right doc id is 740662 which have index 859 in dev file. That's why this doc is not included in passages.bm25.small.json.

I am not sure how these impossible examples are managed in your code (I am not yet familiar with it). Can you tell me more?

Moreover, I am wondering how to compare MRR score with 50 candidate per query with score computed with 1000 candidates per query (20 times more candidates seems to be a lot). With the model I trained (XML-R) I got an MRR of 0.46 with the dev.small set (I excluded impossible examples).

I think I miss something big, but don't find what, can you help me?


NB. : I am evaluating (on going) on dev set with 1K candidates per data point, and MRR is around 0.31 (I am at 1000 datapoints right now). For impossible examples (those not having the real answer), I keep them with a score of 0 (it s what they did here ). I will report final score when evaluation will be finished.

airKlizz commented 4 years ago

Hi,

Yes you're right I only re-rank the best 50 passages ranked by BM25. I didn't re-rank the 1000 passages because it would takes too much time for me. That's why there is only the top 50 passages in passages.bm25.small.json. During the evaluation I re-use the official eval script : msmarco_eval.py (you can find it on the official repo I guess). In the script, if the right candidate is not in the top 10 (MaxMRRRank = 10) candidates then the MRR score add for this query is 0.

I think that exclud impossible examples is not totally fair because they also might be the more difficult queries (not sure at all!) . That's maybe explain the MRR score of 0,46 which seems too good to be real ;)

If you want to score with 1000 candidates, you can see this repo. If you follow all steps you will have a folder collections/ with all passages. Then I have a script to create passages.bm25.small.json with the top N of BM25 you want. Tell me if you are interesting.

Hope my answer helps. Best,

pommedeterresautee commented 4 years ago

Thank you @airKlizz for your explanation. I agree that impossible examples have to be kept and that it should be scored with a 0 MRR. BTW, It's the very purpose of the setup "full ranking" in the Ms Marco leaderboard (opposed to simple "ReRanking"). When I keep impossible examples on top50 setup, I get MRR around 0.38 which is still too high compared to what I was expecting.

My understanding of the top50 dataset is that most of the time (>4K over 6.7K examples), the right candidates is among the top50. On these 4K possible examples, the model have much fewer candidates to choose from and for each of them the score is much higher to what the model would get with 1000 candidates, I think it explains the why the MRR is so high.

Since my last message, I have trained few models with different hyperparameters on 100K examples and evaluate on the original 1000 candidates dev set (top1000.dev), the MRR scores are all between 0.29 and 0.31 which match expectation from the paper.

It makes me think that top50 setup reduces by a lot the difficulty of the task, so I don't think that top50 MRR score should be compared to the top1000 MRR score published in the original paper.