AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
Current implementation of Haystack Pipelines does not support parallel execution of LLMs
Much like how the DocumentWriter uses multiple streams for writing to the DocumentStore, running multiple LLMs should be an option provided enough GPU availability. For workflows that require parsing of a wide variety of documents or modals, the option to use multiple LLMs to allow GPU usage to the maximum would be critical.
Solution
Customizing prompts with PromptTemplate for each branch of LLMs executed in parallel.
Document Router able to redirect input to varying LLMs based on factors such as modal(text/images/audio).
Use Python module multiprocessing to add each additional LLM to the queue as they are executed in parallel.
Expected inputs. Prompt, model kwargs for each type of LLM used. Example GPT variant for OpenAI or pipeline kwargs for HuggingFace models.
Expected output type: llm_{0 to number of llms}
Possible Use cases
Multi-Modality using Routers.
Massive data parsing using LLMs simultaneously. Much more than individual chunks.
Generating variants from same/similar prompts.
Choice of LLMs for type of input language. Ex fr, en, ch, etc.
Parsing of multi-lingual Document chunks using language specific LLMs. Example: Sending supported document types to a Local LLM while unsupported Languages are to be sent to GPT-4o.
Proof of Concept
Refer to the attached ProofOfConcept for sample implementation attempted through custom Components with num_gpu=4 before failing due to inability of the GPU to handle multi-processing without explicit instructions/isolation using multiprocessing.
Python GIL binning is still yet to be relied on. Recommended approach is still multiprocessing instead of multithreading.
Current implementation of Haystack Pipelines does not support parallel execution of LLMs Much like how the DocumentWriter uses multiple streams for writing to the DocumentStore, running multiple LLMs should be an option provided enough GPU availability. For workflows that require parsing of a wide variety of documents or modals, the option to use multiple LLMs to allow GPU usage to the maximum would be critical.
Solution
Possible Use cases
Proof of Concept
num_gpu=4
before failing due to inability of the GPU to handle multi-processing without explicit instructions/isolation using multiprocessing.