Open maxjakob opened 2 months ago
@anakin87 @silvanocerza Would be great to get your input here.
I don't see why not to be fair, I'm not against this at all. Everything you wrote makes totally sense in my opinion.
Are you going to handle the implementation of this? 👀
thank you for your interest!
I agree that breaking changes should be avoided. We can attempt to integrate this into the existing document store. If it proves too hard without breakage we can add a new class (and deprecate the old one). What do you think?
Regarding naming, here are some proposals (I'm completely open to other names):
ElasticsearchBM25Retriever
ElasticsearchDenseEmbeddingRetriever
hybrid
option. Alternatively we can add a ElasticsearchHybridRetriever
.ElasticsearchDenseExactEmbeddingRetriever
(not convinced we need it but it is more efficient for <10k documents)ElasticsearchSparseEmbeddingRetriever
I'm going to work on the LangChain integration. It will become the reference implementation for this kind of integration with the package mentioned above. It would be fantastic if somebody from the community wants to give it a shot and integrate this into Haystack. That somebody would be invited to write a blog post for Elastic Search Labs to get some exposure for them and their Haystack use case in order to make a bit of a marketing noise, if they want to do this kind of thing.
The mentioned LangChain reference implementation can be found here: https://github.com/langchain-ai/langchain-elastic/blob/66cf6f110dbfb2a89a1f92fbaa6488022275e17d/libs/elasticsearch/langchain_elasticsearch/vectorstores.py#L553
Summary and motivation
Elasticsearch offers multiple retrieval features including
Other libraries such as LangChain already have all these options integrated. It would be great to also have them available in Haystack. Elastic is currently working on a Python package that will make the integration of these features easier. Here we want to discuss how to best make them available.
Questions
Detailed design
Concrete proposal:
ElasticsearchDocumentStore
takes an argumentretrieval_strategy
similarly to how it is down in LangChain. Calls towrite_documents
make use of the retrieval strategy to know how to index the data.ElasticsearchDenseVectorRetriever
,ElasticsearchSparseVectorRetriever
,ElasticsearchHybridRetriever
, ...) that get initialized with anElasticsearchDocumentStore
. The retrieval strategy has to match the expectation of the individual retrievers. We check that the expectation is met upon initialization. For retrieving documents, the retrievers call a search method on the document store as this is the established pattern.Checklist
If the request is accepted, ensure the following checklist is complete before closing this issue.