AmenRa / retriv

A Python Search Engine for Humans 🥸
MIT License
187 stars 22 forks source link

Compare retriv's permance to rank_bm25 and pyserini #28

Closed MarshtompCS closed 1 year ago

MarshtompCS commented 1 year ago

Hi! I see that retriv's speed is really impressive in seepd.md. Did you also compare their performances?

AmenRa commented 1 year ago

Hi, performance should be roughly the same for pyserini and retriv. pyserini is built on top of lucene and retriv's BM25 implementation is based on elasticsearch, which is built on top of lucene. The only difference could be the BM25 hyper-parameter setting. retriv uses the same setting of elasticsearch out-of-the-box. pyserini probably uses that of lucene. Text pre-processing could have some minor differences. In the end, you can make them behave the same and they should both performs similarly out-of-the-box. I dunno about rank_bm25. I never looked at its source code.

MarshtompCS commented 1 year ago

Hi, performance should be roughly the same for pyserini and retriv. pyserini is built on top of lucene and retriv's BM25 implementation is based on elasticsearch, which is built on top of lucene. The only difference could be the BM25 hyper-parameter setting. retriv uses the same setting of elasticsearch out-of-the-box. pyserini probably uses that of lucene. Text pre-processing could have some minor differences. In the end, you can make them behave the same and they should both performs similarly out-of-the-box. I dunno about rank_bm25. I never looked at its source code.

I think it is really necessary to compare the performance through datasets. pyserini's authors said there are many weak BM25 implementation, leading to poor performances. https://arxiv.org/pdf/2104.05740.pdf

AmenRa commented 1 year ago

The main problem with BM25 baselines is that most of the people do not optimize its hyper-parameters when performing comparisons. That's one of the main motivation retriv as a feature to allow you doing that very easily.

Regarding performances, as of now, retriv out-of-the-box performs as follows: MSMARCO Dev MRR@10: 0.185 Recall: 0.873 TREC DL 2019 NDCG@10: 0.479 Recall: 0.753 TREC DL 2020 NDCG@10: 0.496 Recall: 0.811

Pyserini out-of-the-box performs as follows: MSMARCO Dev MRR@10: 0.184 Recall: 0.853 TREC DL 2019 NDCG@10: 0.506 Recall: 0.750 TREC DL 2020 NDCG@10: 0.480 Recall: 0.786

The differences you see are mainly due to the the different default BM25's hyper-parameters setting of the two libraries and to a slightly different text pre-processing pipeline.

MarshtompCS commented 1 year ago

That's great! Thanks for repoting this!