Azure OpenAI + random load balancer policy

ericthomas1 commented 1 year ago

Hello,

To setup the Random Load Balancer policy, do I need to name each Azure Open AI Deployed Model the same in each backend?

Example:

APIM endpoint: https://apim-<my-org>.azure-api.net
Azure OpenAI APIM endpoint: https://apim-<my-org>.azure-api.net/deployments/<deployment-name>/chat/completions?api-version=2023-05-15

The issue is the <deployment-name>. If I setup the Random Load Balancer Policy above, and have two different Azure OpenAI APIM Backends, the URL is different, therefore a random number of requests fail.

Hope this makes sense.

Thank you

ericthomas1 commented 1 year ago

Thoughts on this one?

ericthomas1 commented 1 year ago

The answer is "yes". From what I found, to use multiple Azure OpenAI Resources for random load balancing, the Deployed Model names need to be identical.

Example:

Request URL: https:<my-apim-resource>.azure-api.net/deployments/gpt-4-32k/chat/completions?api-version=2023-05-15
Model name deployed to both Azure OpenAI Resources: gpt-4-32k

If one model is named differently, then the random load balancer fails randomly (whenever the Azure OpenAI Resource is randomly selected whose Deployed Model name doesn't match the one in the Request URL).

ishaan-jaff commented 1 year ago

@ericthomas1 I'm the maintainer of litellm https://github.com/BerriAI/litellm - we allow you to easily load balance between 100+ LLMs or Providers I'd love to know if this solves your problem ? And your feedback if something is missing here

Here's how it works

Router - load balancing(Docs)

LiteLLM allows you to load balance between multiple deployments (Azure, OpenAI). It picks the deployment which is below rate-limit and has the least amount of tokens used.

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

Azure / api-management-policy-snippets

Azure OpenAI + random load balancer policy #110

Router - load balancing(Docs)