Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.94k stars 4.08k forks source link

Add OpenAI Priority Load Balancer for Azure OpenAI #1626

Open simonkurtz-MSFT opened 4 months ago

simonkurtz-MSFT commented 4 months ago

This PR introduces the openai-priority-loadbalancer as a native Python option to target one or more Azure OpenAI endpoints. Among the features of the load-balancer are:

Relevant links:


This PR can be merged after @pamelafox's approval.

simonkurtz-MSFT commented 4 months ago

Hi @pamelafox & @kristapratico,

This is how the OpenAI Priority Load Balancer integrates. Nevermind the hard-coded backend and the location of the backends list in this PR. I don't intend to ask for a merge, but this was the best way to give you an idea of the setup.

If you have two AOAI instances with the same model, you can plug them both in and should see load-balancing.

simonkurtz-MSFT commented 4 months ago

I brought up two AOAI instances and related assets and configured both instances as backends in app.py. Then I started to have a conversation.

image

image

Both backends are responding. It's important to note that this is not a uniform distribution because available backends are randomized (have to do so as part of multi-process workloads).

image

At no point did the conversation break down or showed any kind of error through the chat bot.

pamelafox commented 3 months ago

Cool! I made a few changes to the PR to make it a little easier to test out, by actually making the additional backend deployment, mind if I push them to the branch?

I think we should mention this option in the Productionizing guide, and if there are multiple customers wanting to use this approach, we could consider integrating it into main as an option.

pamelafox commented 3 months ago

Here are what my usage graphs look like during a load test btw:

Screenshot 2024-06-02 at 2 57 02 PM Screenshot 2024-06-02 at 2 57 05 PM
simonkurtz-MSFT commented 3 months ago

Cool! I made a few changes to the PR to make it a little easier to test out, by actually making the additional backend deployment, mind if I push them to the branch?

I think we should mention this option in the Productionizing guide, and if there are multiple customers wanting to use this approach, we could consider integrating it into main as an option.

Hi Pamela, please do push! I very much welcome your expertise and improvements. If there are aspects of the 1.0.9 package itself that should/need to be improved, I'm all ears there, too, of course.

Thank you so much! I know this is extraordinary time spent.

simonkurtz-MSFT commented 3 months ago

Here are what my usage graphs look like during a load test btw:

Help me understand your test results, please. Are you hitting different backends or just different models?

pamelafox commented 3 months ago

@simonkurtz-MSFT Those graphs were for two different OpenAI instances in the same region.

pamelafox commented 3 months ago

@simonkurtz-MSFT Could you send a separate PR adding a mention of this approach to https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/productionizing.md#openai-capacity with a link to this PR? You could contrast when someone might opt for this over ACA/APIM (presumably cost/complexity).

simonkurtz-MSFT commented 3 months ago

Hi @pamelafox, could I trouble you for another review of this PR, please? Thank you very much for all your help!