Closed alexlimh closed 3 years ago
Hi Minghan, we don't provide BM25 index and code - we used Lucene & Java and that would mean bringing Java & Lucene installation requirements to our project and thus raising the convenience-to-use bar for community. Anyway, hybrid approach as you can see from the final results is not generally better than the dense only scheme.
Hi Vladimir,
Thanks for your reply. I see your points but I still think it's necessary for reimplementation. Would it be acceptable if I use, for example, elastic search for BM25 and make it a pull request? It would be something like this: https://huggingface.co/docs/datasets/_modules/nlp/search.html
Best, Minghan
For the purpose of bringing BM25 implementation to the repo, I'd use Anserini framework instead of Elastic search. https://github.com/castorini/pyserini
Ah, you are right, pyserini is indeed better. Thanks for the information.
Minghan
I guess we can close this as an issue
Hi,
I'm wondering whether there're codes for DPR + BM25 as described in your paper:
"In addition to DPR, we also present the results of BM25, the traditional retrieval method9 and BM25+DPR, using a linear combination of their scores as the new ranking function. Specifically, we obtain two initial sets of top-2000 passages based on BM25 and DPR, respectively, and rerank the union of them using BM25(q,p) + λ · sim(q, p) as the ranking function. We used λ = 1.1 based on the retrieval accuracy in the development set."
Thanks, Minghan