abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.16k stars 970 forks source link

fix: Avoid thread starvation on many concurrent requests by making use of asyncio to lock llama_proxy context #1798

Open gjpower opened 1 month ago

gjpower commented 1 month ago

Supersedes previous MR #1795

Previous implementation creates and locks threads when acquiring llama_proxy, this can cause thread starvation on many parallel requests. This also prevents call to await run_in_threadpool(llama.create_chat_completion, **kwargs) proceeding as all worker threads are stuck awaiting lock so no progress may be made.

This MR adapts acquiring of llama_proxy to async pattern taking advantage of asyncio mechanisms. ExitStack is replaced with AsyncExitStack and improper closing of the ExitStack is addressed