deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.71k stars 1.83k forks source link

Allow Simplified Input Slot Resolution for Haystack 2.x Pipelines #6101

Open vblagoje opened 11 months ago

vblagoje commented 11 months ago

Description:

Currently, when running a Haystack 2.x pipeline, we have to specify a component that has a certain input slot explicitly. For example:

query_dict = {"query": "What's the meaning of it all?"}
result = pipe.run(data={"search": query_dict, "prompt_builder": {"query": query, "messages": messages}, "similarity_ranker": query_dict})

This can get cumbersome, especially when the number of components increases or the pipeline's complexity grows.

Describe the solution you'd like:

Can't we provide the key/value pairs only without components? The pipeline should then be smart enough to resolve all input slots that can be filled with these key/value pairs. So, ideally, the following should be possible:

result = pipe.run(data= {"query": "What's the meaning of it all?", "messages": messages})

In this scenario, the pipeline should automatically determine which components can be fed with the "query" and "messages" inputs, respectively.

silvanocerza commented 11 months ago

+1 on this, I was thinking about it too. It's quite frustrating having to specify the component name each time.

Timoeller commented 9 months ago

Hey, while the functionality to pass flat dicts is there I would expect a RAG pipeline (with embedder, prompt_builder, and answer_builder) to work with: .run(data = {"query": "What are superlinear returns and why are they important?"}) too.

But it doesn't 😞 because prompt_builder expects a question param and the embedder a text param you have to use: rag_pipeline.run(data = {"query": "What are?", "text": "What are?", "question": "What are?"})

This still needs fixing, otherwise the input slot resolution is not properly fixed. Reopening for now.

vblagoje commented 9 months ago

I don't see how we can solve this unless we try to be more consistent in our naming- where possible @Timoeller

masci commented 9 months ago

Speaking of the RAG pipeline, the only name that we can adjust is question within the prompt, that should be query instead. The other names respect the semantic of the component, for example AnswerBuilder outputs a GeneratedAnswer which has a field called query - this is consistent with AnswerBuilder expecting a query input. Similarly, it makes sense for TextEmbedder to take a text input (and for DocumentEmbedder to take documents).

The current design is also safer for the user: at the moment if you don't pass all the required inputs to run you get a meaningful error explaining what to do to fix it. Now imagine we uniform every component to take a super-term haystack-input, it would be very easy to pipeline.run(data={"haystack-input": "What are superlinear returns?"}) and the pipeline would happily forward the string to any possible component, only to find much later that wasn't the intended behaviour.

To recap: let's consolidate synonyms across the codebase (e.g. replace any question with query, files with paths) ~and close this issue.~ After an offline sync we agreed this needs to be fixed somehow, we'll keep this issue open to discuss ideas.