Add additional distribution strategies to AI Proxy

iMicknl commented 5 months ago

Currently a round-robin strategy is used for the AI Proxy. Would be great to have multiple options here.

Priority based
- For example use your provisioned-managed capacity (PTU) first, and only fail-over to pay as you go, or prefer regions closest to the application first and fail-over to other regions
- PTU deployments use different HTTP headers that show usage and retry-after time

iMicknl commented 5 months ago

Would be good to make this configurable. Perhaps via JSON values that we inject to the AppConfiguration?

Key	Value	Content Type
AzureOpenAIEndpoints	{}	application/json

A JSON could look like this:

[
    {
        "deployment-name": "gpt-35-turbo",
        "distribution-strategy": "priority", // or "random", "priority"
        "endpoints": [
            {
                "endpoint": "https://cog-d7knihn7w73zw-swedencentral.openai.azure.com",
                "deployment_name": "gpt-35-turbo-ptu", // optional (if different name)
                "priority": 1
            },
            {
                "endpoint": "https://cog-d7knihn7w73zw-swedencentral.openai.azure.com",
                "priority": 2
            }
        ]
    },
    {
        "deployment-name": "text-embeddings-ada-002",
        "distribution-strategy": "round-robin", // or "random", "priority"
        "endpoints": [
            {
                "endpoint": "https://cog-d7knihn7w73zw-swedencentral.openai.azure.com",
            },
            {
                "endpoint": "https://cog-d7knihn7w73zw-swedencentral.openai.azure.com",
                "deployment_name": "text-embeddings-ada" // optional (if different name)
            }
        ]
    }
]

Other repositories use a YAML for this: https://github.com/timoklimmer/powerproxy-aoai/blob/main/config/config.example.yaml, and this allows us to load a more advanced config. Pulling this from the API of available models, because you might want to have a more advanced workflow.

azureholic commented 5 months ago

I came to this implementation, based of this article and repo https://techcommunity.microsoft.com/t5/fasttrack-for-azure/smart-load-balancing-for-openai-endpoints-using-containers/ba-p/4017550

{
  "routes": [
    {
      "name": "gpt-35-turbo",
      "endpoints": [
        {
          "address": "https://primaryinstance.openai.azure.com/",
          "priority": 1
        },
        {
          "address": "https://seondaryinstance.openai.azure.com/",
          "priority": 2
        }
      ]
    },
    {
      "name": "text-embedding-ada-002",
      "endpoints": [
        {
          "address": "https://primaryinstance.openai.azure.com/",
          "priority": 1
        },
        {
          "address": "https://secondaryinstance.openai.azure.com/",
          "priority": 1
        }
      ]
    },
    {
      "name": "gpt-35-turbo-withpolicy",
      "endpoints": [
        {
          "address": "https://primaryinstance.openai.azure.com/",
          "priority": 1
        }
      ]
    }
  ]
}

The repo will provision the first two routes as part of the deployment. "gpt-35-turbo-withpolicy" is just an example and is not part of the deployment (yet).

several endpoints within a route can have the same priority -> backend will be picked random several endpoints within a route can have a different priority -> backend will favor prio 1 over prio 2, but will fallback to prio 2 endpoint(s) when hitting a rate-limit on prio 1 endpoint

a route always corresponds with a unique deploymentname within the config. The endpoint should have that deployment available (proxy will not check, just forward). The reason for this is that AzureOpenAI uses the deployment name in the URL.

azureholic commented 5 months ago

feature/smartlb PR should implement this

Azure / enterprise-azureai

Add additional distribution strategies to AI Proxy #22