abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.82k stars 933 forks source link

Concurrent request handling #1062

Open khanjandharaiya opened 9 months ago

khanjandharaiya commented 9 months ago

Hey there!! 🙏

I am currently working on a project that involves the sending request to the model using flask api and when user sends the request concurrently the model is not able to handle it. Is there any way i can handle multiple concurrent request to the model and serve multiple users at the same time?

Please help! @abetlen

AayushSameerShah commented 9 months ago

I am not sure if your case is similar, but I am facing the same issue:

  1. I have created a flask API endpoint /request in which the user makes the POST request
  2. Based on the information received by the user, I put that in the prompt and it gives result.
  3. I have made the "threaded=True" in flask app.run(threaded=True) which means each new request can be processed separately.
  4. But doing so, with 2 users, the server crashes because it goes for loading 2 models and doesn't work.

I am possibly looking for the same solution, I hope we find some solution to this. Thanks for opening the issue 🙏🏻


PS: I was also looking for #771, https://github.com/abetlen/llama-cpp-python/issues/897 👀

This library also provides a server: https://github.com/abetlen/llama-cpp-python/blob/82072802ea0eb68f7f226425e5ea434a3e8e60a0/llama_cpp/server/app.py#L165-L168

Probably that may help.

littlebai3618 commented 8 months ago

If the hardware computing power is insufficient, the benefits of parallel inference are low. I implemented a simple parallel inference using this project and tested it on V100S. Under the condition of 2 concurrency, the efficiency is not as high as that of a single request.

Supporting parallel inference (batch processing) is a very complex task, involving issues such as kv-cache logit. Instead, you can use the api_like_OAI.py provided by llama.cpp as an alternative. This service supports parallel inference, although the performance is slightly lower during parallel execution.

sergey-zinchenko commented 3 months ago

Hi! I just made such a solution for myself. Here is the code: https://github.com/sergey-zinchenko/llama-cpp-python/tree/model_lock_per_request

I did introduce async locking of all the model stuff for all kinds of requests—stream and not. All the requests will be handled one by one, so it's not kind of concurrent, but at least the server will not crash or interrupt the request it handles at the moment.

malik-787 commented 3 months ago

@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?

sergey-zinchenko commented 3 months ago

@sergey-zinchenko can you provide more details like what changes did you made and how to adopt them in our llama-cpp-python?

https://github.com/abetlen/llama-cpp-python/pull/1550

sergey-zinchenko commented 3 months ago

@malik-787 Shortly I added global async lock mechanism to handle request one by one limiting the number of maximum awaiting requests on the uvicorn level. The server stops crashing and stop closing ongoing inferencing and in my PR incoming requests just awaiting finishing of ongoing one. IMHO that way is more or less better for multiuser scenarios and for k8s deployment.