Parallel LLM Support for maximal GPU Usage

Current implementation of Haystack Pipelines does not support parallel execution of LLMs Much like how the DocumentWriter uses multiple streams for writing to the DocumentStore, running multiple LLMs should be an option provided enough GPU availability. For workflows that require parsing of a wide variety of documents or modals, the option to use multiple LLMs to allow GPU usage to the maximum would be critical.

Solution

Customizing prompts with PromptTemplate for each branch of LLMs executed in parallel.
Document Router able to redirect input to varying LLMs based on factors such as modal(text/images/audio).
Use Python module multiprocessing to add each additional LLM to the queue as they are executed in parallel.
Expected inputs. Prompt, model kwargs for each type of LLM used. Example GPT variant for OpenAI or pipeline kwargs for HuggingFace models.
Expected output type: llm_{0 to number of llms}

Possible Use cases

Multi-Modality using Routers.
Massive data parsing using LLMs simultaneously. Much more than individual chunks.
Generating variants from same/similar prompts.
Choice of LLMs for type of input language. Ex fr, en, ch, etc.
Parsing of multi-lingual Document chunks using language specific LLMs. Example: Sending supported document types to a Local LLM while unsupported Languages are to be sent to GPT-4o.

Proof of Concept

Refer to the attached ProofOfConcept for sample implementation attempted through custom Components with num_gpu=4 before failing due to inability of the GPU to handle multi-processing without explicit instructions/isolation using multiprocessing.
Python GIL binning is still yet to be relied on. Recommended approach is still multiprocessing instead of multithreading.

ProofOfConcept

deepset-ai / haystack

Parallel LLM Support for maximal GPU Usage #8257