Azure / aoai-smart-loadbalancing

Smart load balancing for Azure OpenAI endpoints
MIT License
76 stars 26 forks source link

Handling throttling(429) at deployment(s) level under a single Instance #7

Open jayendranarumugam opened 10 months ago

jayendranarumugam commented 10 months ago

Currently, we are handling the 429 at the endpoint level (skipping the deployments). However those TPM/RPM are defined at the deployment level

We can have multiple deployments at a single instance like gpt3,gpt4, etc., In this scenario, how we can handle the 429?

The current logic can handle a single deployment alone, so that if that deployment (lets assume gpt3.5turbo) is giving 429, it can be marked as throttling. But if there are multiple deployments, we cannot simply mark the endpoints has throttling as it can be capable of handling other deployments (like gtp4,gpt4-turbo, etc.,)

How can we sever such multiple deployments under a single instance ?

andredewes commented 10 months ago

You can do that by specifying the full deployment model in the BACKEND_X_URL setting. For example:

Then the load balancer will forward that full path to the chosen backend. However, from your client-side you need to remove that path you added in the backend when sending the requests. For example, in your client you go from:

To

Otherwise, it will duplicate that part of your path. Let me know if this works for you or not. Some client-side SDKs might automatically add some parts of the URL which can remove your flexibility to send whatever path... let me know if that's your case and we can work to have another feature to facility this case!

jayendranarumugam commented 10 months ago

Thanks @andredewes for the quick replay. While this will give allow us to hit the deployment-specific route, however I cannot handle or scale this solution to support multiple deployments at different instances. (Grouping all the Instances by deployments)

Let me put my use-case here

I have 2 instances of openai (Instance-1 and instance-2 )

Both of these instances have 2 deployments gpt35turbodeployment and gpt4deployment

Here the assumption is for a given deployment, the name will be the same across all the instances

andredewes commented 10 months ago

I think I understand what you're trying to say. You don't want to mix GPT35 and GPT4 from your applications perspective, they already specific to their desired model within the request. You don't want an app sending a /gpt4 path to end up in a /gpt35 backend.

In this scenario, wouldn't it make more sense to deploy two instances of the load balancer, one for your GPT3.5 endpoints and the other for the GPT4 applications?

jayendranarumugam commented 10 months ago

I think I understand what you're trying to say. You don't want to mix GPT35 and GPT4 from your applications perspective, they already specific to their desired model within the request. You don't want an app sending a /gpt4 path to end up in a /gpt35 backend.

Exactly

In this scenario, wouldn't it make more sense to deploy two instances of the load balancer, one for your GPT3.5 endpoints and the other for the GPT4 applications?

Is that too much infra/cost to handle? Also, look at the enterprise level, as models will grow more in the future. So, adding more loading balancers to facilitate each model will be a good design from scalability side? Since Yarp already has the capability of adding more clusters, destinations with multiple routes, Using that capability, can we implement this within a single load balancer?

andredewes commented 10 months ago

Good question. I think this is a balance of "ease" of use vs how complex the code and configuration can be. One of the drawbacks of this solution is that YARP still doesn't support retries natively (check here https://github.com/microsoft/reverse-proxy/issues/56) but once it does, we need to reevaluate this code completely and any YARP-style configuration will become much more straightforward to implement.

Now, coming back to your concern about capacity: if you check the memory consumption of this container, you will see it stays around 40-60MB after its initial startup. And it goes up and down depending how much traffic you have. That's still an exceptionally low and acceptable consumption IMO.

Can we revisit this topic once YARP implements natively HTTP retries and for now keep it simple?

jayendranarumugam commented 10 months ago

One of the drawbacks of this solution is that YARP still doesn't support retries natively (check here https://github.com/microsoft/reverse-proxy/issues/56)

Thanks for this. I believe this is why you implemented the Passive mode with the custom ThrottlingHealthPolicy ? Can't it be scaled for multi-cluster Yarp design?

andredewes commented 10 months ago

The custom ThrottlingHealthPolicy is needed because we want the passive health checker to set the backend to be "unhealthy" only during the time specified in the Retry-After HTTP response headers. This logic is not built-in in any standard YARP policy.

And yes, it is possible to scale to multi-cluster. This is becoming a more common requirement lately and we're planning to implement this in the coming months.

andredewes commented 10 months ago

The custom ThrottlingHealthPolicy is needed because we want the passive health checker to set the backend to be "unhealthy" only during the time specified in the Retry-After HTTP response headers. This logic is not built-in in any standard YARP policy.

And yes, it is possible to scale to multi-cluster. This is becoming a more common requirement lately and we're planning to implement this in coming weeks.