castorini / pygaggle

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini
http://pygaggle.ai/
Apache License 2.0
338 stars 98 forks source link

TREC passage ranking result reproduce #181

Closed yixuan-qiao closed 3 years ago

yixuan-qiao commented 3 years ago

For TREC passage full-rank task, i use the prebuilt index msmarco-passage-expanded, set_bm25 with k1=float(0.82), b=float(0.68), no rm3, the Top1000 file reranked by monoT5-3B finally got R@1000=0.7853, NDCG@10=0.7101. The R@1000 is far below than the paper(0.8452), even lower than traditional bm25 baseline 0.7863.

I also try to rebuilt the index file follow the instructions from https://github.com/castorini/docTTTTTquery, but got same result. I'm not sure if I made a mistake or if there are something else I didn't consider.

yixuan-qiao commented 3 years ago

I just found i use the wrong qrel file. After change all "1" judgments (Related) to 0 (not relevant), i can reproduce the R@1000 metric.

lintool commented 3 years ago

Glad you got this sorted out!

yixuan-qiao commented 3 years ago

After change all "1" judgments (Related) to 0 (not relevant), i got the R@1000 0.84432423 compared to your 0.8452. At first I thought i reproduced your results, but after re-rank by monoT5-3b model, i got NDCG@10 0.7101 compared to your 0.7837. Although without duoT5-3b, maybe 0.76 or 0.77 is a desired result? i set bm25 with k1=float(0.82), b=float(0.68) as paper told, i am not sure is there something I might have missed? I do think the smallest difference in R@1000 should have a big effect on the ranking performance.

The following is the code we used

from pyserini.search import SimpleSearcher searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage-expanded') searcher.set_bm25(k1=float(0.82), b=float(0.68)) ... hits = searcher.search(Query(q_text_f).text, k=1000) # do this for every query

lintool commented 3 years ago

See this detail about using the qrels: https://github.com/castorini/anserini/blob/master/docs/regressions-dl19-passage.md#effectiveness

The relevances grades are different nDCG and MAP.

yixuan-qiao commented 3 years ago

For the qrels, i use the official TREC files from https://trec.nist.gov/data/deep2020.html, same as yours.

lintool commented 3 years ago

For trec_eval, for passage, you need to use -l 2 for MAP, but not for nDCG, per documentation above.

yixuan-qiao commented 3 years ago

As you said, i use qrels with three different level grades after transform by use -l 2 to test all metrics excel nDCG, use official qrels with 4 level grades to test nDCG, but got the same results as above.

lintool commented 3 years ago

It is usually an eval issue if one metric matches but others are different...

I'll let the rest of the team chime in, but head's up - everyone is likely busy with EMNLP deadlines.

yixuan-qiao commented 3 years ago

I'm sorry maybe I didn't express correctly, no metrics are matching now,

for R@1000, i got 0.8443 vs yours 0.8452 for NDCG@10, i got 0.7101 vs yours 0.7837

Good luck with EMNLP, thanks a lot!!!

lintool commented 3 years ago

for R@1000, i got 0.8443 vs yours 0.8452

This is within the margin of noise for neural techniques... I would consider these the same for the purposes of reproduction.

yixuan-qiao commented 3 years ago

But bm25 is not a neural techniques, i do think it is a deterministic algorithm. In my opinion, even the samllest difference in R@1000 could produce the different relevent passage lists. For comparison, in document ranking task, i use default config bm25, reproduce your results with same R@1000 0.8403, the last four digits are exactly the same, for NDCG@10, we got 0.688 without doing duoT5 compard to your 0.6934 which is acceptable.

lintool commented 3 years ago

I'm confused. For TREC 20 DL track, passage, we have:

What exactly isn't matching then?

If it's specifically monoT5-3B - I believe all our experiments are using TPUs... what's the setup on your end?

yixuan-qiao commented 3 years ago

I just compare to your latest paper. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models(https://arxiv.org/abs/2101.05667) image

i just reproduce the same R@1000 0.8443 as your second doc https://github.com/castorini/anserini/blob/master/docs/regressions-dl20-passage-docTTTTTquery.md, but in this paper you got a differnet R@1000 0.8452, we re-rank this retrieved file by monoT5-3B, got NDCG@10 0.7101 vs 0.7837 in your paper. carefully read through your paper, i found no other tricks.

yixuan-qiao commented 3 years ago

maybe 0.76 or 0.77 is a desired result?

lintool commented 3 years ago

From https://github.com/castorini/anserini/blob/master/docs/regressions-dl20-passage-docTTTTTquery.md

Screen Shot 2021-05-15 at 10 03 04 AM

It appears that BM25 w/ default parameters gives 0.8452 for R@1k, so that's probably what we used. In general, tuning on MS MARCO sparse judgments doesn't transfer over to TREC dense judgments - we learned this from TREC 2019, so for 2020 we must have just used the BM25 defaults.

So this resolves the 0.8443 vs 0.8452 issue.

First stage retrieval checks out, then.

I think your core issue is that you can't reproduce monoT5 and duoT5?

monoT5-3B, got NDCG@10 0.7101 vs 0.7837 in your paper.

Row (5) is monoT5 followed by duoT5 reranking, though. If you just ran monoT5-3B, naturally your results will be lower...

yixuan-qiao commented 3 years ago

For document ranking task, we got NDCG@10 0.688 without doing duoT5-3b compard to your 0.6934, so i guess maybe improvemnt of duoT5-3b is about 1 points. But as you said, maybe for passage ranking task, improvement is more significant (~7 points)?! if so, it is so amazing! If possible, would you mind open source the duoT5-3b model so we can try it.

lintool commented 3 years ago

For document ranking task, we got NDCG@10 0.688 without doing duoT5-3b compard to your 0.6934, so i guess maybe improvemnt of duoT5-3b is about 1 points.

So yes, I would consider this a successful monoT5 reproduction... you're within half a point.

The gains of duo over mono on passage is much larger because (1) our doc runs were zero shot, and (2) doc length issues.

Exact duo gains, we're working on an updated version of paper right now w/ ablations.

So the tl;dr seems to be duoT5. I'll work with the team on getting it posted on hgf.

yixuan-qiao commented 3 years ago

I am looking forward to it. Thanks a lot for patient response. Wish everything goes well for the EMNLP :-)