Closed MarshtompCS closed 1 year ago
Hi, performance should be roughly the same for pyserini
and retriv
.
pyserini
is built on top of lucene
and retriv
's BM25
implementation is based on elasticsearch
, which is built on top of lucene
. The only difference could be the BM25
hyper-parameter setting. retriv
uses the same setting of elasticsearch
out-of-the-box. pyserini
probably uses that of lucene
. Text pre-processing could have some minor differences. In the end, you can make them behave the same and they should both performs similarly out-of-the-box.
I dunno about rank_bm25
. I never looked at its source code.
Hi, performance should be roughly the same for
pyserini
andretriv
.pyserini
is built on top oflucene
andretriv
'sBM25
implementation is based onelasticsearch
, which is built on top oflucene
. The only difference could be theBM25
hyper-parameter setting.retriv
uses the same setting ofelasticsearch
out-of-the-box.pyserini
probably uses that oflucene
. Text pre-processing could have some minor differences. In the end, you can make them behave the same and they should both performs similarly out-of-the-box. I dunno aboutrank_bm25
. I never looked at its source code.
I think it is really necessary to compare the performance through datasets. pyserini's authors said there are many weak BM25 implementation, leading to poor performances. https://arxiv.org/pdf/2104.05740.pdf
The main problem with BM25 baselines is that most of the people do not optimize its hyper-parameters when performing comparisons. That's one of the main motivation retriv
as a feature to allow you doing that very easily.
Regarding performances, as of now, retriv
out-of-the-box performs as follows:
MSMARCO Dev MRR@10: 0.185 Recall: 0.873
TREC DL 2019 NDCG@10: 0.479 Recall: 0.753
TREC DL 2020 NDCG@10: 0.496 Recall: 0.811
Pyserini
out-of-the-box performs as follows:
MSMARCO Dev MRR@10: 0.184 Recall: 0.853
TREC DL 2019 NDCG@10: 0.506 Recall: 0.750
TREC DL 2020 NDCG@10: 0.480 Recall: 0.786
The differences you see are mainly due to the the different default BM25's hyper-parameters setting of the two libraries and to a slightly different text pre-processing pipeline.
That's great! Thanks for repoting this!
Hi! I see that retriv's speed is really impressive in seepd.md. Did you also compare their performances?