`llm_params` behavior can break on parallel requests

NVIDIA / NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.

Other

4.21k stars 400 forks source link

`llm_params` behavior can break on parallel requests #726

Open drazvan opened 2 months ago

drazvan commented 2 months ago

The pattern used in various places is:

with llm_params(llm, **params):
    result = await llm_call(llm, prompt)

However, if multiple parallel requests are made, and for example, one of them sets the max_tokens, it seems to affect the other parallel requests as well. We need to rethink/fix the underlying implementation.

Pouyanpi commented 1 week ago

Potential fixes:

Instance Isolation: requests are sharing one instance of llm?
Thread or Task local context
Refactor llm_params to avoid directly modifying shared state. For example, instead of modifying llm directly, llm_params could return a modified copy of llm or apply parameters only within the scope of each function call.