Hello, I came across this project while searching for OpenAI API compatible servers for llama.cpp and I was wondering if this can handle multiple requests at once?
Loading another model into RAM for each concurrent user doesn't seem like a great idea, and I was wondering if this was even possible at all with this project.
Yes this project does handle multiple requests, the model is loaded once and an inference session is spawned for each request. :) There is a batching request that will come soon :)
Hello, I came across this project while searching for OpenAI API compatible servers for llama.cpp and I was wondering if this can handle multiple requests at once?
Loading another model into RAM for each concurrent user doesn't seem like a great idea, and I was wondering if this was even possible at all with this project.
Thank you for your work!