Open kanenorman opened 6 days ago
I'm encountering a peculiar configuration error in CI when running this line of code in the test:
connection: MongoClient = MongoClient(
os.environ["MONGO_CONNECTION_STRING"],
driver=DriverInfo(name="MongoDBAtlasHaystackIntegration")
)
The error is as follows:
/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/synchronous/mongo_client.py:797: in __init__
seeds.update(uri_parser.split_hosts(entity, port))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
hosts = '', default_port = 27017
def split_hosts(hosts: str, default_port: Optional[int] = DEFAULT_PORT) -> list[_Address]:
"""Takes a string of the form host1[:port],host2[:port]...
Splits it into (host, port) tuples. If [:port] isn't present, the
default_port is used.
...
"""
nodes = []
for entity in hosts.split(","):
if not entity:
> raise ConfigurationError("Empty host (or extra comma in host list).")
E pymongo.errors.ConfigurationError: Empty host (or extra comma in host list).
/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/uri_parser.py:392: ConfigurationError
This issue occurs only on Mac/Linux CI but not on Windows. Curiously, I cannot reproduce the failure locally on my Mac. I'll continue investigating the root cause of this inconsistency in the CI environment.
EDIT:
@mpangrazzi. Realized I won't have access to {{ secrets.MONGO_CONNECTION_STRING }}
in the CI Workflow from a public fork.
Hey @kanenorman, thanks for the contribution and sorry for the delay...
I will try to have a look in the next few days!
@kanenorman my colleague @mpangrazzi and I will take over the review of this PR.
This PR already looks good. Before we dive into details, would you please:
Can we somehow use both text-search and vector indices? If so an example of both retrieval setups for our users would be great. If this is possible perhaps @mpangrazzi and I can work on this example.
Regarding CI - you are right. Let me see internally what we can do here short of migrating your branch to our main fork and two of us doing the final polishing, checks and the final integration.
@vblagoje - Thank you! I'll address the two bullet points you mentioned above.
Also,
Can we somehow use both text-search and vector indices? If so, an example of both retrieval setups for our users would be great. If this is possible, perhaps @mpangrazzi and I can work on this example.
For this, are you requesting to see a hybrid retrieval scenario where we demonstrate:
MongoDBAtlasFullTextRetriever
utilizing MongoDBAtlasDocumentStore.full_text_search_index
MongoDBAtlasEmbeddingRetriever
utilizing MongoDBAtlasDocumentStore.vector_search_index
Related Issues
Proposed Changes:
Added
MongoDBAtlasDocumentStore._fulltext_retrieval()
and the correspondingMongoDBAtlasFullTextRetriever
component, enabling users to leverage MongoDB Atlas's full-text search capabilities.How did you test it?
full_text_search_index
MongoDBAtlasFullTextRetriever
component totest_retriever.py
I did not implement a test for full-text retrieval. Similar to
test_embedding_retrieval.py
, it would likely be more secure and maintainable for a member of the Haystack team to write the test for a collection created in Haystack's MongoDB Atlas cluster, specifically in a file calledtest_fulltext_retrieval.py.
. (I would also be willing to write test if that is preferred).Notes for the reviewer
In
MongoDBAtlasDocumentStore._fulltext_retrieval()
, filters are currently passed directly to the$match
stage, similar to how Langchain implements this behavior. While this approach works, it is not optimal for performance, as outlined in the MongoDB Atlas documentation. A more efficient solution would involve adding afilter
clause to thecompound
operator. However, this requires additional work since the filter syntax used in the$match
stage is not directly compatible with the$filter
clause. Implementing this would require writing a new helper function,_normalize_compound_filter
, to convert Haystack filter syntax into the format understood by MongoDB’s compound filter. Given that this would significantly expand the size & scope of this PR, I believe it is more appropriate to address this enhancement in a separate PR.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.