Previous implementation creates and locks threads when acquiring llama_proxy, this can cause thread starvation on many parallel requests.
This also prevents call to await run_in_threadpool(llama.create_chat_completion, **kwargs) proceeding as all worker threads are stuck awaiting lock so no progress may be made.
This MR adapts acquiring of llama_proxy to async pattern taking advantage of asyncio mechanisms. ExitStack is replaced with AsyncExitStack and improper closing of the ExitStack is addressed
Supersedes previous MR #1795
Previous implementation creates and locks threads when acquiring llama_proxy, this can cause thread starvation on many parallel requests. This also prevents call to
await run_in_threadpool(llama.create_chat_completion, **kwargs)
proceeding as all worker threads are stuck awaiting lock so no progress may be made.This MR adapts acquiring of llama_proxy to async pattern taking advantage of asyncio mechanisms. ExitStack is replaced with AsyncExitStack and improper closing of the ExitStack is addressed