Azure / azure-openai-benchmark

Azure OpenAI benchmarking tool
MIT License
118 stars 51 forks source link

Decoupling Throttling retries logic & Throttling backoff logic. #34

Open sushaanttb opened 6 months ago

sushaanttb commented 6 months ago

As checked from the code, we have 2 retry logics for retrying 429 errors:

image

File: oairequester.py Class: OAIRequester Function: _call

& Currently both these two retry logics are coupled together.

This violates Separation of Concerns principle. Also, not sure if the end user would always want that - i.e. multiple retries for the same request from two different logics. For example, If Backoff strategy is specified by the user, then the same 429 request which was retried in previous code snippet based on RETRY_AFTER_MS_HEADER will again be retried. multiple times again!

Solution Suggestion It would be a good idea that (just like backoff) to also parametrize the retry logic from command line arguments. So that the end user could then accordingly adjust the behavior of retries as per their requirement.

michaeltremeer commented 6 months ago

I am copy-pasting my response from your other issue for anyone else who reads this issue:

Hey @sushaanttb,

Have a look at https://github.com/Azure/azure-openai-benchmark/pull/21 and the discussion within. You will see some more info there about what is going on with different retry settings, and how MAX_RETRY_SECONDS effects the outputs. That PR has not been merged and it doesn't look like it will, so if you want the behaviour you suggest, I would suggest you use my fork which fixes this issue and uses a more suitable value of 60 seconds for MAX_RETRY_SECONDS.

sushaanttb commented 6 months ago

Hi @michaeltremeer Thanks for the comment!

As your PR is now merged after I raised those issues. Hence, looking at the latest code and the discussion you referred : But, I am not sure how that resolves this issue.

So now with your PR, The issue of not retrying the requests at all when retry=none, is fixed, which is good! However, when retry=exponential we are still retrying the request with two different retry logics.

Should we not give that control to user to decide?

Attaching screenshot for ref. as per the latest code now:

image

Basically, introducing another argument/parameter if user wants to retry using "retry header" strategy instead of governing it too from the same "backoff" check

michaeltremeer commented 6 months ago

Hey @sushaanttb, you're spot on that the behaviour here doesn't make sense, and if you follow the logic then the actual outcome for a throttled request on PTU-M will look something like this:

There's one other important idea to explore though. While in the real-world we are going to see spikey usage across the end-point (which is where the 429 behaviour of PTU-M can be of huge benefit), as of right now this load testing tool delivers a consistent, linear call rate. Because it is linear, if the endpoint can only process 10 RPM but you are sending 15, you will always end up in a situation where the queue of new requests starts to build up until every client has an open request. In this case, it means that all requests are now going to be retrying regularly, and with enough clients (which create a longer queue of waiting requests), almost all requests will end up hitting the 60+60 seconds timeout and failing.

For this reason the whole idea of having retries in this tool is bit of a waste of time, since you either have a rate low enough that you aren't throttling and don't need to retry, or you have the load too high and you are guaranteeing an ever-increasing queue of requests. This means that the only benefit of being able to manually set MAX_RETRY_SECONDS is that you are effectively setting the ceiling on the max latency that you will see on your requests. Your RPM and TPM won't change, you are just moving the maximum ceiling on your PTU-M request latencies, making it look worse due to longer latency on each request. In the real-world, you would usually use the 429 responses in a couple ways:

  1. For low-priority requests where latency doesn't matter, if they receive a single 429 then you will immediately divert them to a PayGO resource. This releases the PTU-M capacity for other higher priority requests, e.g. chat use cases.
  2. If all the remaining requests on the endpoint are high priority requests but you're still throttling, you would still want to set a max retry limit so that if they retry more than 2 times, then you divert to a PayGO resource. This helps avoid the problem of an ever-growing queue of requests as users continually bombard the PTU-M resource with more requests than it can handle. At the end of the day, the lower latency of PayGO is still preferred to a situation of all your users creating a big long queue of requests, guaranteeing all other users an unacceptable response latency.

In both of the situations above, there is a 'circuit-breaker' that can shed load to a backup resource when load exceeds the resource capacity. The tool in it's current state assumes this backup resource doesn't exist, which leads to unrealistic results which should never occur in production (if your architecture is designed correctly).

On your various points across these issues:

sushaanttb commented 6 months ago

Thanks @michaeltremeer for the detailed comments!

I basically tested this tool on a non-PTU deployment (as I do not have a PTU based deployment yet) & by purposefully also building the scenario to hit the rate limits and see the throttling numbers. I actually wanted to understand the output in a bit more detail.

Now, after your comment, I again had a look on the PTU's documentations, but I couldn't find any note on PTU-M & PTU-C specifically. For ref., these are the documentation links which I referred Getting started with PTU & What is provisioned throughput?.

& when selecting the model deployment type as well, I am getting following option for PTU , it only lists Provisioned-Managed as the option.

image

Hence, difficult for me to comment on those points before actually knowing about it, but still thanks for the details & it will be helpful if you can share any such reference link on PTU-M and PTU-C.

However, I agree with your thought in your next comment that the retry logic doesn't seems much helpful as it's just creating an ever-increasing queue of requests to be retried (because once a 429 error is received, even if we are parking the current request to retry after some time, the other requests would still keep on happening to the service , without respecting to the notion that the service has already hit it's limit & thus they would also go through that same retry logic) Hence, I raised this issue of decoupling both the retry logics and making them configurable so that at least the control is transferred to the user to decide whether they want to test the tool with any retry strategy or not and thus no hidden/additional retries happening under the hood.

Subsequently, thanks for clarifying on the reason for keeping MAX_RETRY_SECONDS hardcoded, could you please also reply the same in this Issue36 which I had raised for same? so that anyone looking there on that issue can know this reason.

Lastly, I also had a question about the README- The README says that this tool is meant for benchmarking PTU based deployment, but I guess this tool can be used for normal non-PTU deployments as well, right? The only additional metric we are getting in case of PTU based deployment is the "azure-openai-deployment-utilization" header, rest everything else seems generic. As the "retry-header" should come in case of normal deployment as well (haven't tested it but this is my hunch so far that any good service like Azure OpenAI would at atleast be sending the standard "retry-after" header for 429 errors) .Thoughts?