Closed yixuan-qiao closed 3 years ago
I just found i use the wrong qrel file. After change all "1" judgments (Related) to 0 (not relevant), i can reproduce the R@1000 metric.
Glad you got this sorted out!
After change all "1" judgments (Related) to 0 (not relevant), i got the R@1000 0.84432423 compared to your 0.8452. At first I thought i reproduced your results, but after re-rank by monoT5-3b model, i got NDCG@10 0.7101 compared to your 0.7837. Although without duoT5-3b, maybe 0.76 or 0.77 is a desired result? i set bm25 with k1=float(0.82), b=float(0.68) as paper told, i am not sure is there something I might have missed? I do think the smallest difference in R@1000 should have a big effect on the ranking performance.
The following is the code we used
from pyserini.search import SimpleSearcher searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage-expanded') searcher.set_bm25(k1=float(0.82), b=float(0.68)) ... hits = searcher.search(Query(q_text_f).text, k=1000) # do this for every query
See this detail about using the qrels: https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage.md#effectiveness
The relevances grades are different nDCG and MAP.
For the qrels, i use the official TREC files from https://trec.nist.gov/data/deep2020.html, same as yours.
For trec_eval
, for passage, you need to use -l 2
for MAP, but not for nDCG, per documentation above.
As you said, i use qrels with three different level grades after transform by use -l 2 to test all metrics excel nDCG, use official qrels with 4 level grades to test nDCG, but got the same results as above.
It is usually an eval issue if one metric matches but others are different...
I'll let the rest of the team chime in, but head's up - everyone is likely busy with EMNLP deadlines.
I'm sorry maybe I didn't express correctly, no metrics are matching now,
for R@1000, i got 0.8443 vs yours 0.8452 for NDCG@10, i got 0.7101 vs yours 0.7837
Good luck with EMNLP, thanks a lot!!!
for R@1000, i got 0.8443 vs yours 0.8452
This is within the margin of noise for neural techniques... I would consider these the same for the purposes of reproduction.
But bm25 is not a neural techniques, i do think it is a deterministic algorithm. In my opinion, even the samllest difference in R@1000 could produce the different relevent passage lists. For comparison, in document ranking task, i use default config bm25, reproduce your results with same R@1000 0.8403, the last four digits are exactly the same, for NDCG@10, we got 0.688 without doing duoT5 compard to your 0.6934 which is acceptable.
I'm confused. For TREC 20 DL track, passage, we have:
What exactly isn't matching then?
If it's specifically monoT5-3B - I believe all our experiments are using TPUs... what's the setup on your end?
I just compare to your latest paper. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models(https://arxiv.org/abs/2101.05667)
i just reproduce the same R@1000 0.8443 as your second doc https://github.com/castorini/anserini/blob/master/docs/regressions-dl20-passage-docTTTTTquery.md, but in this paper you got a differnet R@1000 0.8452, we re-rank this retrieved file by monoT5-3B, got NDCG@10 0.7101 vs 0.7837 in your paper. carefully read through your paper, i found no other tricks.
maybe 0.76 or 0.77 is a desired result?
From https://github.com/castorini/anserini/blob/master/docs/regressions-dl20-passage-docTTTTTquery.md
It appears that BM25 w/ default parameters gives 0.8452 for R@1k, so that's probably what we used. In general, tuning on MS MARCO sparse judgments doesn't transfer over to TREC dense judgments - we learned this from TREC 2019, so for 2020 we must have just used the BM25 defaults.
So this resolves the 0.8443 vs 0.8452 issue.
First stage retrieval checks out, then.
I think your core issue is that you can't reproduce monoT5 and duoT5?
monoT5-3B, got NDCG@10 0.7101 vs 0.7837 in your paper.
Row (5) is monoT5 followed by duoT5 reranking, though. If you just ran monoT5-3B, naturally your results will be lower...
For document ranking task, we got NDCG@10 0.688 without doing duoT5-3b compard to your 0.6934, so i guess maybe improvemnt of duoT5-3b is about 1 points. But as you said, maybe for passage ranking task, improvement is more significant (~7 points)?! if so, it is so amazing! If possible, would you mind open source the duoT5-3b model so we can try it.
For document ranking task, we got NDCG@10 0.688 without doing duoT5-3b compard to your 0.6934, so i guess maybe improvemnt of duoT5-3b is about 1 points.
So yes, I would consider this a successful monoT5 reproduction... you're within half a point.
The gains of duo over mono on passage is much larger because (1) our doc runs were zero shot, and (2) doc length issues.
Exact duo gains, we're working on an updated version of paper right now w/ ablations.
So the tl;dr seems to be duoT5. I'll work with the team on getting it posted on hgf.
I am looking forward to it. Thanks a lot for patient response. Wish everything goes well for the EMNLP :-)
For TREC passage full-rank task, i use the prebuilt index msmarco-passage-expanded, set_bm25 with k1=float(0.82), b=float(0.68), no rm3, the Top1000 file reranked by monoT5-3B finally got R@1000=0.7853, NDCG@10=0.7101. The R@1000 is far below than the paper(0.8452), even lower than traditional bm25 baseline 0.7863.
I also try to rebuilt the index file follow the instructions from https://github.com/castorini/docTTTTTquery, but got same result. I'm not sure if I made a mistake or if there are something else I didn't consider.