distantmagic / paddler

Stateful load balancer custom-tailored for llama.cpp 🏓🦙
MIT License
630 stars 27 forks source link

Sticky Sessions #10

Open mcharytoniuk opened 3 months ago

mcharytoniuk commented 3 months ago

It should be possible to always direct requests to a specific slot and distribute them among all the observed servers.

Once a request is issued, all the following requests should land in the specific slot.

It can be implemented with a cookie.

See also:

VJHack commented 2 months ago

I can take a crack at this in the next few days. @mcharytoniuk, I just had a couple questions about this:

  1. I'm a bit confused what this has to do with adding control vectors in the server. What is the relationship between control vectors and sticky sessions?
  2. What benefit would sticky sessions really provide? The slots are released after completion so prompts are processed independently and are stateless. Is this supposed to be some kind of performance optimization?
mcharytoniuk commented 2 months ago

I can take a crack at this in the next few days.

That would be awesome @VJHack . :)

  1. I'm a bit confused what this has to do with adding control vectors in the server. What is the relationship between control vectors and sticky sessions?
  2. What benefit would sticky sessions really provide? The slots are released after completion so prompts are processed independently and are stateless. Is this supposed to be some kind of performance optimization?

According to that llamacpp PR about control vectors - after it is merged to llamacpp it should be possible to configure each slot with a different control vector. There can be a scenario where user configures several llamacpp servers in the same way - with a specific control vector at the same slot slot number at each of them.

Then, when a specific cookie (or something similar) is present in the request, that request can be balanced only between slots with that specific control vector.

Overall I have been thinking either about using cookies, or to create some Paddler specific endpoints that allow to tag specific slots, and then issue requests to slots that are configured with that specific tag.

VJHack commented 2 months ago

Sorry, I'd love to work on this but I'm quite busy at the moment. If someone else wants to take it on, they can. I'll come back and revisit it if it's still open later.