deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
14.59k stars 1.71k forks source link

Pipeline run order wrong #7985

Open ju-gu opened 5 days ago

ju-gu commented 5 days ago

Describe the bug When having a more complex pipeline the run order fails by not being able to identify the first node and then setting the "documents" and query input to empty strings. Nodes are executed multiple times overwriting these wrong intermediary outputs again during run time.

The point of failure is in the _component_has_enough_inputs_to_run method of the pipeline.py, as expected inputs for prompt_builder1 are question, template and template_variables and the input parameters are just question, resulting in the function returning false. Later a different component is being executed with "default" values, which are all None / empty strings. Though the template is being parsed already upon instantiation to the prompt builder and the template_variables just include the question parsed in the run method. So no mismatch between expected and input parameters should be there.

Parsing template and template_variables in the run method resolves this issue (shouldn't be needed though).

Output of the sample pipeline (nodes are executed multiple times and starting with the second llm):

image

To Reproduce

Run this pipeline and check the execution order

from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Pipeline
from dotenv import load_dotenv
import os
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
import logging
from haystack.utils import Secret

logging.basicConfig()
logging.getLogger("haystack.core.pipeline.pipeline").setLevel(logging.DEBUG)

doc_store = InMemoryDocumentStore()
path = "../data/test_folder/"
pathlist = [path+x for x in os.listdir(path)]
converter = TextFileToDocument()

print(f"Documents: {doc_store.count_documents()}")

load_dotenv("ENV_PATH")
openai_api_key = Secret.from_env_var("OPENAI_API_KEY")

prompt_template1 = """
You are a spellchecking system. Check the given query and fill in the corrected query.

Question: {{question}}
Corrected question: 
"""
prompt_template2 = """
According to these documents:

{% for doc in documents %}
  {{ doc.content }}
{% endfor %}

Answer the given question: {{question}}
Answer:
"""

prompt_template3 = """
{% for ans in replies %}
  {{ ans }}
{% endfor %}
"""

prompt_builder1 = PromptBuilder(template=prompt_template1)
prompt_builder2 = PromptBuilder(template=prompt_template2)
prompt_builder3 = PromptBuilder(template=prompt_template3)

llm1 = OpenAIGenerator(api_key=openai_api_key)
llm2 = OpenAIGenerator(api_key=openai_api_key)

ranker = TransformersSimilarityRanker(top_k=5)
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
splitter = DocumentSplitter(split_by="word", split_length=200, split_overlap=10)
writer = DocumentWriter(document_store=doc_store)

indexing_p = Pipeline()
indexing_p.add_component(name="converter", instance=converter)
indexing_p.add_component(name="splitter", instance=splitter)
indexing_p.add_component(name="DocEmbedder", instance=doc_embedder)
indexing_p.add_component(name="writer", instance=writer)

indexing_p.connect("converter.documents", "splitter")
indexing_p.connect("splitter.documents", "DocEmbedder.documents")
indexing_p.connect("DocEmbedder.documents", "writer.documents")

indexing_p.run({"converter": {"sources": pathlist}})

print(f"Documents: {doc_store.count_documents()}")

pipeline = Pipeline()
pipeline.add_component(name="TextEmbedder", instance=embedder)
pipeline.add_component(name="retriever", instance=retriever)
pipeline.add_component(name="ranker", instance=ranker)
pipeline.add_component(name="prompt_builder2", instance=prompt_builder2)
pipeline.add_component(name="prompt_builder1", instance=prompt_builder1)
pipeline.add_component(name="prompt_builder3", instance=prompt_builder3)
pipeline.add_component(name="llm", instance=llm1)
pipeline.add_component(name="spellchecker", instance=llm2)

pipeline.connect("prompt_builder1", "spellchecker")
pipeline.connect("spellchecker.replies", "prompt_builder3")
pipeline.connect("prompt_builder3", "TextEmbedder.text")
pipeline.connect("prompt_builder3", "ranker.query")
pipeline.connect("TextEmbedder", "retriever.query_embedding")
pipeline.connect("retriever", "ranker")
pipeline.connect("ranker", "prompt_builder2.documents")
pipeline.connect("prompt_builder3", "prompt_builder2.question")
pipeline.connect("prompt_builder2", "llm")

question = "Wha i Acromegaly?"
result = pipeline.run({
    "prompt_builder1": {"question": question}})
# print(result)

test_data.zip

FAQ Check

silvanocerza commented 5 days ago

I briefly investigated by bisecting. The last commit this Pipeline works is https://github.com/deepset-ai/haystack/commit/badb05b3abb09fa190049b31b975365d69dd0112, the bug seems introduced with the commit right after https://github.com/deepset-ai/haystack/commit/83d3970405085aae5b22dc0f715398077f1f71fc.

Seems like the changes to PromptBuilder in #7655 surfaced this bug.

I'm still not sure what's the actual cause and will keep investigating.

silvanocerza commented 2 days ago

Temporary workdaournd is adding required_variables in PromptBuilders as done below makes the Pipeline run as expected.

prompt_builder2 = PromptBuilder(template=prompt_template2, required_variables=["documents", "question"])
prompt_builder3 = PromptBuilder(template=prompt_template3, required_variables=["replies"])

Another solution could be changing the order the PromptBuilders are added in the Pipeline:

pipeline.add_component(name="prompt_builder1", instance=prompt_builder1)
pipeline.add_component(name="prompt_builder3", instance=prompt_builder3)
pipeline.add_component(name="prompt_builder2", instance=prompt_builder2)

This problems is caused by a combination of some things. The way we decide which Component to run next, the fact that Components addition order influences the run order and how we treat Components that have only inputs with defaults.

Ideally the fix would change how we decide which Components to run that is independent from the other two factors. And also doesn't break existing use cases.

Not sure how easy that will be. 😕