Closed CollectiveUnicorn closed 3 months ago
The VLLM backend has the skeleton of what's needed to accomplish this: https://github.com/defenseunicorns/leapfrogai-backend-vllm but instead of managing concurrent request there should be a limiter that only allow one request at a time.
When receiving multiple request, instead of allowing them to interleave and produce gibberish we should queue the requests. In this way all users requests can be fulfilled in the order they are received. Or if possible, attempt to add real concurrency.