Storia-AI / sage

Chat with any codebase in under two minutes | Fully local or via third-party APIs
https://sage.storia.ai
Apache License 2.0
1.08k stars 91 forks source link

Feature request: Implement reciprocal rank fusion for combining multiple retriever outputs #55

Open mihail911 opened 1 month ago

mihail911 commented 1 month ago

Right now when we are combining the outputs of a bm25 encoder and a dense retriever we simply do a weighted average of their scores. It's more standard to use reciprocal rank fusion methods to combine multiple scores.

We should implement this alternative hybrid scoring method

iuliaturc commented 1 month ago

This is actually quite trivial via Langchain's EnsembleRetriever: https://python.langchain.com/docs/how_to/ensemble_retriever/

iuliaturc commented 1 month ago

Also just wanted to confirm that the hybrid retriever we are using right now (PineconeHybridSearchRetriever) does indeed use a simplistic weighing according to Pinecone's documentation (I second-guessed myself because I had never read it explicitly, just assumed that was the case).

So the easiest way forward is to use Langchain's EnsembleRetriever if we want reciprocal rank fusion.

aarya-16 commented 1 month ago

Hello @iuliaturc, I've seen what needs to be done and would like to work on this. Please assign this to me.

iuliaturc commented 1 month ago

All yours @aarya-16 :)

aarya-16 commented 1 month ago

Hello @iuliaturc I have made PR #87 for this issue. Let me know if it checks out and also if you want me to open a different Pull Request to add the unit tests for this file. (A different PR would be nice since it is Hacktoberfest 😄 )