Open khanjandharaiya opened 9 months ago
I am not sure if your case is similar, but I am facing the same issue:
/request
in which the user makes the POST requestapp.run(threaded=True)
which means each new request can be processed separately.I am possibly looking for the same solution, I hope we find some solution to this. Thanks for opening the issue 🙏🏻
PS: I was also looking for #771, https://github.com/abetlen/llama-cpp-python/issues/897 👀
This library also provides a server: https://github.com/abetlen/llama-cpp-python/blob/82072802ea0eb68f7f226425e5ea434a3e8e60a0/llama_cpp/server/app.py#L165-L168
Probably that may help.
If the hardware computing power is insufficient, the benefits of parallel inference are low. I implemented a simple parallel inference using this project and tested it on V100S. Under the condition of 2 concurrency, the efficiency is not as high as that of a single request.
Supporting parallel inference (batch processing) is a very complex task, involving issues such as kv-cache logit. Instead, you can use the api_like_OAI.py provided by llama.cpp as an alternative. This service supports parallel inference, although the performance is slightly lower during parallel execution.
Hi! I just made such a solution for myself. Here is the code: https://github.com/sergey-zinchenko/llama-cpp-python/tree/model_lock_per_request
I did introduce async locking of all the model stuff for all kinds of requests—stream and not. All the requests will be handled one by one, so it's not kind of concurrent, but at least the server will not crash or interrupt the request it handles at the moment.
@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?
@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?
@malik-787 Shortly I added global async lock mechanism to handle request one by one limiting the number of maximum awaiting requests on the uvicorn level. The server stops crashing and stop closing ongoing inferencing and in my PR incoming requests just awaiting finishing of ongoing one. IMHO that way is more or less better for multiuser scenarios and for k8s deployment.
Hey there!! 🙏
I am currently working on a project that involves the sending request to the model using flask api and when user sends the request concurrently the model is not able to handle it. Is there any way i can handle multiple concurrent request to the model and serve multiple users at the same time?
Please help! @abetlen