defenseunicorns / leapfrogai-backend-llama-cpp-python

LeapfrogAI backend using llama-cpp-python
Apache License 2.0
3 stars 0 forks source link

Add the ability to handle multiple users via queuing or concurrency #31

Closed CollectiveUnicorn closed 3 months ago

CollectiveUnicorn commented 8 months ago

When receiving multiple request, instead of allowing them to interleave and produce gibberish we should queue the requests. In this way all users requests can be fulfilled in the order they are received. Or if possible, attempt to add real concurrency.

CollectiveUnicorn commented 8 months ago

The VLLM backend has the skeleton of what's needed to accomplish this: https://github.com/defenseunicorns/leapfrogai-backend-vllm but instead of managing concurrent request there should be a limiter that only allow one request at a time.