deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
122 stars 120 forks source link

feat: MongoDB Atlas keyword search #1200

Open kanenorman opened 6 days ago

kanenorman commented 6 days ago

Related Issues

Proposed Changes:

Added MongoDBAtlasDocumentStore._fulltext_retrieval() and the corresponding MongoDBAtlasFullTextRetriever component, enabling users to leverage MongoDB Atlas's full-text search capabilities.

How did you test it?

I did not implement a test for full-text retrieval. Similar to test_embedding_retrieval.py, it would likely be more secure and maintainable for a member of the Haystack team to write the test for a collection created in Haystack's MongoDB Atlas cluster, specifically in a file called test_fulltext_retrieval.py.. (I would also be willing to write test if that is preferred).

Notes for the reviewer

In MongoDBAtlasDocumentStore._fulltext_retrieval(), filters are currently passed directly to the $match stage, similar to how Langchain implements this behavior. While this approach works, it is not optimal for performance, as outlined in the MongoDB Atlas documentation. A more efficient solution would involve adding a filter clause to the compound operator. However, this requires additional work since the filter syntax used in the $match stage is not directly compatible with the $filter clause. Implementing this would require writing a new helper function, _normalize_compound_filter, to convert Haystack filter syntax into the format understood by MongoDB’s compound filter. Given that this would significantly expand the size & scope of this PR, I believe it is more appropriate to address this enhancement in a separate PR.

Checklist

kanenorman commented 6 days ago

I'm encountering a peculiar configuration error in CI when running this line of code in the test:

connection: MongoClient = MongoClient(
    os.environ["MONGO_CONNECTION_STRING"], 
    driver=DriverInfo(name="MongoDBAtlasHaystackIntegration")
)

The error is as follows:

/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/synchronous/mongo_client.py:797: in __init__
    seeds.update(uri_parser.split_hosts(entity, port))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hosts = '', default_port = 27017

    def split_hosts(hosts: str, default_port: Optional[int] = DEFAULT_PORT) -> list[_Address]:
        """Takes a string of the form host1[:port],host2[:port]...
        Splits it into (host, port) tuples. If [:port] isn't present, the
        default_port is used.
        ...
        """
        nodes = []
        for entity in hosts.split(","):
            if not entity:
>               raise ConfigurationError("Empty host (or extra comma in host list).")
E               pymongo.errors.ConfigurationError: Empty host (or extra comma in host list).

/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/uri_parser.py:392: ConfigurationError

This issue occurs only on Mac/Linux CI but not on Windows. Curiously, I cannot reproduce the failure locally on my Mac. I'll continue investigating the root cause of this inconsistency in the CI environment.

EDIT: @mpangrazzi. Realized I won't have access to {{ secrets.MONGO_CONNECTION_STRING }} in the CI Workflow from a public fork.

image
anakin87 commented 4 days ago

Hey @kanenorman, thanks for the contribution and sorry for the delay...

I will try to have a look in the next few days!

vblagoje commented 10 hours ago

@kanenorman my colleague @mpangrazzi and I will take over the review of this PR.

This PR already looks good. Before we dive into details, would you please:

Can we somehow use both text-search and vector indices? If so an example of both retrieval setups for our users would be great. If this is possible perhaps @mpangrazzi and I can work on this example.

Regarding CI - you are right. Let me see internally what we can do here short of migrating your branch to our main fork and two of us doing the final polishing, checks and the final integration.

kanenorman commented 1 hour ago

@vblagoje - Thank you! I'll address the two bullet points you mentioned above.

Also,

Can we somehow use both text-search and vector indices? If so, an example of both retrieval setups for our users would be great. If this is possible, perhaps @mpangrazzi and I can work on this example.

For this, are you requesting to see a hybrid retrieval scenario where we demonstrate: