Open mcharytoniuk opened 6 months ago
hi @mcharytoniuk - thanks for this interesting project ! we use a combination of llama-cpp -server and ollama - both running on dockers and have implemented our ow python based proxy/LB. looking to move to something specialist like paddler. Can we do this today with paddler?
@aiseei Thank you for reaching out!
You can absolutely use Paddler with your llama.cpp setup in production. Personally, I am using it with Auto Scaling groups with llama.cpp.
When it comes to Ollama, not at the moment.
The issue is that Ollama potentially starts and manages multiple llamas.cpp servers internally on its own and does not expose some llama.cpp internal endpoints (like /health
: https://github.com/ollama/ollama/issues/1378), and statuses; currently, it does not allow hooking into some llama.cpp APIs that Paddler requires to function.
I might try to get it to work for just OpenAPI-like endpoints if there is some interest in having Ollama integration, though. However, that would have some limitations compared to balancing based on slots (slots allow us to predict how many requests a server can handle at most, so that allows predictable buffering). Do you think that would be ok for your use case?
@aiseei I think I have a few ideas on how to handle the issue. I will add Ollama, and other OpenaAI-style APIs support to paddler. See also: https://github.com/distantmagic/paddler/issues/18
@mcharytoniuk hi - sorry for the late reply. Yes , supporting the OPENA AI API style would work. Btw came across this issue tiday https://github.com/ollama/ollama/issues/6492 might be relevant as u support ollama.
@mcharytoniuk hi - sorry for the late reply. Yes , supporting the OPENA AI API style would work. Btw came across this issue tiday ollama/ollama#6492 might be relevant as u support ollama.
Bringing issues and news like that help me with maintaining the package, it is easier for me to follow what is relevant int the ecosystem. Thank you!
llama.cpp exposes the
/health
endpoint, which makes it easy to deal with slots. What about other similar solutions?