distantmagic / paddler

Stateful load balancer custom-tailored for llama.cpp
MIT License
515 stars 22 forks source link

Explore integration options with Ollama and other backends #6

Open mcharytoniuk opened 3 months ago

mcharytoniuk commented 3 months ago

llama.cpp exposes the /health endpoint, which makes it easy to deal with slots. What about other similar solutions?

aiseei commented 3 weeks ago

hi @mcharytoniuk - thanks for this interesting project ! we use a combination of llama-cpp -server and ollama - both running on dockers and have implemented our ow python based proxy/LB. looking to move to something specialist like paddler. Can we do this today with paddler?

mcharytoniuk commented 3 weeks ago

@aiseei Thank you for reaching out!

You can absolutely use Paddler with your llama.cpp setup in production. Personally, I am using it with Auto Scaling groups with llama.cpp.

When it comes to Ollama, not at the moment.

The issue is that Ollama potentially starts and manages multiple llamas.cpp servers internally on its own and does not expose some llama.cpp internal endpoints (like /health: https://github.com/ollama/ollama/issues/1378), and statuses; currently, it does not allow hooking into some llama.cpp APIs that Paddler requires to function.

I might try to get it to work for just OpenAPI-like endpoints if there is some interest in having Ollama integration, though. However, that would have some limitations compared to balancing based on slots (slots allow us to predict how many requests a server can handle at most, so that allows predictable buffering). Do you think that would be ok for your use case?

mcharytoniuk commented 3 weeks ago

@aiseei I think I have a few ideas on how to handle the issue. I will add Ollama, and other OpenaAI-style APIs support to paddler. See also: https://github.com/distantmagic/paddler/issues/18

aiseei commented 1 week ago

@mcharytoniuk hi - sorry for the late reply. Yes , supporting the OPENA AI API style would work. Btw came across this issue tiday https://github.com/ollama/ollama/issues/6492 might be relevant as u support ollama.

mcharytoniuk commented 1 week ago

@mcharytoniuk hi - sorry for the late reply. Yes , supporting the OPENA AI API style would work. Btw came across this issue tiday ollama/ollama#6492 might be relevant as u support ollama.

Bringing issues and news like that help me with maintaining the package, it is easier for me to follow what is relevant int the ecosystem. Thank you!