creating an alternative in-memory bm25 store and retriever

Guest400123064 commented 2 months ago

Dear maintainers, I am a newcomer to the Haystack project and have been enjoying the framework thus far! However, when I went to the source code of the in-memory document store, if I understood it correctly, the bm25 retriever implementation is suboptimal as it recreates an inverse index on every new search. Therefore, I tried to implement an alternative solution in this repo following the custom document store template. I am not sure if this should be an integration (because it is not actually related to any other technologies), and this is my first time trying to contribute to an open-source project, so I am not sure how I should move forward. I have not yet published a package to PyPI (but it is installable from the GitHub repo). Moreover, I have some different thoughts on the filters as well. So, to sum up, any suggestion would be greatly appreciated!

julian-risch commented 2 months ago

Dear @Guest400123064 thank you for reaching out! Great repo, we saw your LinkedIn post. 👏 🙂 You're right that the current bm25 implementation in Haystack is suboptimal. It really recreates the index on every search. It would be great if you could open a PR in the Haystack repository and contribute improvements of that specific part of the implementation of the InMemoryDocumentStore. Other improvements might be interesting too but we should briefly discuss them before you put in efforts to open another PR. For example, changing the filtering logic would be a breaking change that we might not want to merge into Haystack.

We have written down contribution guidelines here to make the start as easy as possible: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md You can also open a Draft PR early on while you are working on the PR so that you can get some early feedback from our team.

I suggest that you read the guidelines and then open a PR in the Haystack repository (not in haystack-integrations) that improves the InMemoryDocumentStore in Haystack so that it does not recreate the reverse index for BM25 for every search. That PR should be relatively small but would have great impact. Does that sound like a good start to contributing to an open source project? 🙂

Guest400123064 commented 2 months ago

Thanks for the reply! I can start working on that! If I am understanding it correctly, I will only change the indexing logic and leaving the filtering and tokenization method unchanged? Another question would be, do we want to keep using rank_bm25 as an backend? The reason for asking is that the indexing logic is built within the package, which I have no control over.

julian-risch commented 2 months ago

@Guest400123064 It would suggest to leave out the changes of the filtering and tokenization logic for the first PR, yes. Smaller changes make it easier to review and merge. The filtering we probably don't want to change. The tokenization we can discuss. If you find that the implementation works better without rank_bm25, I don't see a reason to keep it. By the way, you probably saw that Haystack uses the rank_bm25 from here: https://github.com/deepset-ai/haystack-bm25

Guest400123064 commented 2 months ago

Got it! I will do some initial work to see how things go

Guest400123064 commented 2 months ago

Closing this thread after merge

deepset-ai / haystack-integrations

creating an alternative in-memory bm25 store and retriever #218