deepset-ai / haystack-integrations

🚀 A list of Haystack Integrations, maintained by the community or deepset.
48 stars 62 forks source link

code examples in `elasticsearch-document-store.md` throw errors #98

Closed annthurium closed 8 months ago

annthurium commented 8 months ago

I tried to run the code in elasticsearch-document-store.md and ran into some errors. I attempted to fix them but got slightly stuck. If someone could point me in the right direction, happy to open a PR.

The topmost block of code:

document_store = ElasticsearchDocumentStore(hosts = "http://localhost:9200")
converter = TextFileToDocument()
splitter = DocumentSplitter()
doc_embedder = SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/multi-qa-mpnet-base-dot-v1")
writer = DocumentWriter(document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("doc_embedder", doc_embedder)
indexing_pipeline.add_component("writer", writer)

indexing_pipeline.connect("converter", "splitter")
indexing_pipeline.connect("splitter", "doc_embedder")
indexing_pipeline.connect("doc_embedder", "writer")

indexing_pipeline.run({
    "converter":{"sources":["filename.txt"]}
    })

Produces an error:

Failed to write documents to Elasticsearch. Errors:
[{'create': {'_index': 'default', '_id': '6383dc3ed51fc90c2e45704853a7ad9b14168f4f262ac7a0b65e02c465d0bb1c', 'status': 400, 'error': {'type': 'document_parsing_exception', 'reason': "[1:15833] failed to parse: The [dense_vector] field [embedding] in doc [document with id '6383dc3ed51fc90c2e45704853a7ad9b14168f4f262ac7a0b65e02c465d0bb1c'] has a different number of dimensions [768] than defined in the mapping [1024]", 'caused_by': {'type': 'illegal_argument_exception', 'reason': "The [dense_vector] field [embedding] in doc [document with id '6383dc3ed51fc90c2e45704853a7ad9b14168f4f262ac7a0b65e02c465d0bb1c'] has a different number of dimensions [768] than defined in the mapping [1024]"}}}}]'

Other than the SentenceTransformerTextEmbedder, which of these components requires us to specify a model_name_or_path? It wasn't easy to figure out from looking at the documentation or reading the Haystack source code. 🤔

The second block of code, I'm running into the same error about a mismatch in vector index lengths. There were also a few errors with param names and such that were easy to clean up:

from elasticsearch_haystack.document_store import ElasticsearchDocumentStore
from haystack.pipeline import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder 
from elasticsearch_haystack.embedding_retriever import ElasticsearchEmbeddingRetriever

model_name_or_path = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

document_store = ElasticsearchDocumentStore(hosts = "http://localhost:9200")
retriever = ElasticsearchEmbeddingRetriever(document_store=document_store)
text_embedder = SentenceTransformersTextEmbedder(model_name_or_path=model_name_or_path)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", text_embedder)
query_pipeline.add_component("retriever", retriever)
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query_pipeline.run({"text_embedder": {"text": "historical places in Instanbul"}})