deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
82 stars 78 forks source link

Elasticsearch: support dense, sparse, hybrid with inference in Elasticsearch #699

Open maxjakob opened 2 months ago

maxjakob commented 2 months ago

Summary and motivation

Elasticsearch offers multiple retrieval features including

Other libraries such as LangChain already have all these options integrated. It would be great to also have them available in Haystack. Elastic is currently working on a Python package that will make the integration of these features easier. Here we want to discuss how to best make them available.

Questions

Detailed design

Concrete proposal:

  1. ElasticsearchDocumentStore takes an argument retrieval_strategy similarly to how it is down in LangChain. Calls to write_documents make use of the retrieval strategy to know how to index the data.
  2. We add a number of different retrievers (ElasticsearchDenseVectorRetriever, ElasticsearchSparseVectorRetriever, ElasticsearchHybridRetriever, ...) that get initialized with an ElasticsearchDocumentStore. The retrieval strategy has to match the expectation of the individual retrievers. We check that the expectation is met upon initialization. For retrieving documents, the retrievers call a search method on the document store as this is the established pattern.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

### Tasks
- [ ] The code is documented with docstrings and was merged in the `main` branch
- [ ] Docs are published at https://docs.haystack.deepset.ai/
- [ ] There is a Github workflow running the tests for the integration nightly and at every PR
- [ ] A label named like `integration:<your integration name>` has been added to this repo
- [ ] The [labeler.yml](https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/labeler.yml) file has been updated
- [ ] The package has been released on PyPI
- [ ] An integration tile has been added to https://github.com/deepset-ai/haystack-integrations
- [ ] The integration has been listed in the [Inventory section](https://github.com/deepset-ai/haystack-core-integrations#inventory) of this repo README
- [ ] There is an example available to demonstrate the feature
- [ ] The feature was announced through social media
maxjakob commented 2 months ago

@anakin87 @silvanocerza Would be great to get your input here.

silvanocerza commented 2 months ago

I don't see why not to be fair, I'm not against this at all. Everything you wrote makes totally sense in my opinion.

silvanocerza commented 2 months ago

Are you going to handle the implementation of this? 👀

anakin87 commented 2 months ago

thank you for your interest!

maxjakob commented 2 months ago

I agree that breaking changes should be avoided. We can attempt to integrate this into the existing document store. If it proves too hard without breakage we can add a new class (and deprecate the old one). What do you think?

Regarding naming, here are some proposals (I'm completely open to other names):

maxjakob commented 2 months ago

I'm going to work on the LangChain integration. It will become the reference implementation for this kind of integration with the package mentioned above. It would be fantastic if somebody from the community wants to give it a shot and integrate this into Haystack. That somebody would be invited to write a blog post for Elastic Search Labs to get some exposure for them and their Haystack use case in order to make a bit of a marketing noise, if they want to do this kind of thing.

maxjakob commented 1 month ago

The mentioned LangChain reference implementation can be found here: https://github.com/langchain-ai/langchain-elastic/blob/66cf6f110dbfb2a89a1f92fbaa6488022275e17d/libs/elasticsearch/langchain_elasticsearch/vectorstores.py#L553