deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
82 stars 80 forks source link

Mongo Dense Retriever - Unrecognized pipeline stage name: '$vectorSearch' #583

Closed tillwf closed 3 months ago

tillwf commented 4 months ago

Describe the bug Like this bug: https://github.com/deepset-ai/haystack/issues/7031 but with haystack-ai==2.0.0

Error message

haystack.document_stores.errors.errors.DocumentStoreError: Retrieval of documents from MongoDB Atlas failed: Unrecognized pipeline stage name: '$vectorSearch', full error: {'ok': 0.0, 'errms
g': "Unrecognized pipeline stage name: '$vectorSearch'", 'code': 40324, 'codeName': 'Location40324', '$clusterTime': {'clusterTime': Timestamp(1710497081, 11), 'signature': {'hash': b'\xd7"b
C\xf4\xd0\xb8\r\xc8\xe56b/xn\xda\'\x98B\x7f', 'keyId': 7294832502911270929}}, 'operationTime': Timestamp(1710497081, 11)}

To Reproduce Here is a simple code to reproduce:

import os

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.utils import ComponentDevice
from haystack.utils import Secret
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore

document_store = MongoDBAtlasDocumentStore(
    mongo_connection_string=Secret.from_env_var("MONGO_CONNECTION_STRING"),
    database_name=os.getenv("MONGO_DB"),
    collection_name="articles_embeddings",
    vector_search_index="embedding_index",
)
document_cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
    remove_repeated_substrings=False,
)
document_splitter = DocumentSplitter(
    split_by="word",
    split_length=512,
    split_overlap=32
)
document_embedder = SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-small-en-v1.5",
    device=ComponentDevice.from_str("cuda:0")
)
text_embedder = SentenceTransformersTextEmbedder(
    model="BAAI/bge-small-en-v1.5",
    device=ComponentDevice.from_str("cuda:0")
)

embedding_retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store)
ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base")

pipeline = Pipeline()
pipeline.add_component("text_embedder", text_embedder)
pipeline.add_component("embedding_retriever", embedding_retriever)
pipeline.add_component("ranker", ranker)

pipeline.connect("text_embedder", "embedding_retriever")
pipeline.connect("embedding_retriever", "ranker")

# First search to warm the model
pipeline.run(
    {
        "text_embedder": {
            "text": "test"
        },
        "ranker": {
            "query": "test"
        }
    }
)

Here is a screen of my index I made: image

and the code I used to create it:

{
  "fields":[
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 384,
      "similarity": "cosine"
    }
  ]
}

FAQ Check

System:

anakin87 commented 3 months ago

I cannot reproduce the bug.

What I did:

os.environ["MONGO_CONNECTION_STRING"]="..."

document_store = MongoDBAtlasDocumentStore( database_name="test", collection_name="test", vector_search_index="vector_index", )

indexing phase

docs = [Document(content="This is a test", meta={"name": "test"}), Document(content="this is a document about dogs", meta={"name": "dog_doc"}), Document(content="this is a document about cats", meta={"name": "cat_doc"})]

embedder = SentenceTransformersDocumentEmbedder(model="BAAI/bge-small-en-v1.5") embedder.warm_up()

docs_with_embeddings = embedder.run(docs)["documents"]

print(document_store.write_documents(docs_with_embeddings))

3

retrieval phase

retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=3) results = retriever.run(query_embedding=[0.1]*384) print(results)

{'documents': [Document(id=0fc6abdbe5192ea10917b506084077451b47ccf097d5899f963a193b048a33a7, content: 'this is a document about cats', meta: {'name': 'cat_doc'}, score: 0.5037540197372437, embedding: vector of size 384), Document(id=ffd30337557ed1870cb5833d832c1a3c41f4889b3545e9c0b5e69108592661fd, content: 'This is a test', meta: {'name': 'test'}, score: 0.503305971622467, embedding: vector of size 384), Document(id=274731104067ab6f2e07380d4b1cd20112b26cd99fc6f36da8f9f4a7d6f06e00, content: 'this is a document about dogs', meta: {'name': 'dog_doc'}, score: 0.5031192898750305, embedding: vector of size 384)]}



I also tried a more complex example, with a Retrieval Pipeline with a Text Embedder and a Ranker, but I cannot reproduce the error.

@tillwf I'm closing the issue. Feel free to reopen it and add more details if the problem persists.