Decoupling Throttling retries logic & Throttling backoff logic.

sushaanttb commented 6 months ago

As checked from the code, we have 2 retry logics for retrying 429 errors:

File: oairequester.py Class: OAIRequester Function: _call

& Currently both these two retry logics are coupled together.

This violates Separation of Concerns principle. Also, not sure if the end user would always want that - i.e. multiple retries for the same request from two different logics. For example, If Backoff strategy is specified by the user, then the same 429 request which was retried in previous code snippet based on RETRY_AFTER_MS_HEADER will again be retried. multiple times again!

Solution Suggestion It would be a good idea that (just like backoff) to also parametrize the retry logic from command line arguments. So that the end user could then accordingly adjust the behavior of retries as per their requirement.

michaeltremeer commented 6 months ago

I am copy-pasting my response from your other issue for anyone else who reads this issue:

Hey @sushaanttb,

Have a look at https://github.com/Azure/azure-openai-benchmark/pull/21 and the discussion within. You will see some more info there about what is going on with different retry settings, and how MAX_RETRY_SECONDS effects the outputs. That PR has not been merged and it doesn't look like it will, so if you want the behaviour you suggest, I would suggest you use my fork which fixes this issue and uses a more suitable value of 60 seconds for MAX_RETRY_SECONDS.

sushaanttb commented 6 months ago

Hi @michaeltremeer Thanks for the comment!

As your PR is now merged after I raised those issues. Hence, looking at the latest code and the discussion you referred : But, I am not sure how that resolves this issue.

So now with your PR, The issue of not retrying the requests at all when retry=none, is fixed, which is good! However, when retry=exponential we are still retrying the request with two different retry logics.

Should we not give that control to user to decide?

Attaching screenshot for ref. as per the latest code now:

Basically, introducing another argument/parameter if user wants to retry using "retry header" strategy instead of governing it too from the same "backoff" check

michaeltremeer commented 6 months ago

Hey @sushaanttb, you're spot on that the behaviour here doesn't make sense, and if you follow the logic then the actual outcome for a throttled request on PTU-M will look something like this:

The initial request gets throttled, and will retry again after the amount of time suggested in the the retry-after header value that is returned from the PTU. For PTU-M, this method is more intelligent than just using exponential backoff and is the right way to do it. It will keep doing this until MAX_RETRY_SECONDS is reached, which is 60 seconds by default. The next call will then fail (since it has exceeded 60 seconds), raise for status, and then the exponential backoff decorator is triggered.
Now the request will be retried by the backoff decorator, and with a new request start time, giving it another 60 seconds of retries within the if self.backoff... loop, similar to the first request. When this hits the 60 seconds timer, it fails and raises for status, and now the backoff decorator does not retry it, since it is beyond the 60 retry second limit.
Practically speaking, it means that for PTU-M requests, almost all retry logic is determined by the retry-after header rather than the exponential backoff schedule.
Note that things are a little different for PayGO/PTU-C, and in those cases the only retry logic that will occur is exponential backoff. Realistically though, it's rare to have throttling in these cases (it's either due to hitting PayGO rate limits or PTU-C request timeouts), so the value of backoff here is not very strong.

There's one other important idea to explore though. While in the real-world we are going to see spikey usage across the end-point (which is where the 429 behaviour of PTU-M can be of huge benefit), as of right now this load testing tool delivers a consistent, linear call rate. Because it is linear, if the endpoint can only process 10 RPM but you are sending 15, you will always end up in a situation where the queue of new requests starts to build up until every client has an open request. In this case, it means that all requests are now going to be retrying regularly, and with enough clients (which create a longer queue of waiting requests), almost all requests will end up hitting the 60+60 seconds timeout and failing.

For this reason the whole idea of having retries in this tool is bit of a waste of time, since you either have a rate low enough that you aren't throttling and don't need to retry, or you have the load too high and you are guaranteeing an ever-increasing queue of requests. This means that the only benefit of being able to manually set MAX_RETRY_SECONDS is that you are effectively setting the ceiling on the max latency that you will see on your requests. Your RPM and TPM won't change, you are just moving the maximum ceiling on your PTU-M request latencies, making it look worse due to longer latency on each request. In the real-world, you would usually use the 429 responses in a couple ways:

For low-priority requests where latency doesn't matter, if they receive a single 429 then you will immediately divert them to a PayGO resource. This releases the PTU-M capacity for other higher priority requests, e.g. chat use cases.
If all the remaining requests on the endpoint are high priority requests but you're still throttling, you would still want to set a max retry limit so that if they retry more than 2 times, then you divert to a PayGO resource. This helps avoid the problem of an ever-growing queue of requests as users continually bombard the PTU-M resource with more requests than it can handle. At the end of the day, the lower latency of PayGO is still preferred to a situation of all your users creating a big long queue of requests, guaranteeing all other users an unacceptable response latency.

In both of the situations above, there is a 'circuit-breaker' that can shed load to a backup resource when load exceeds the resource capacity. The tool in it's current state assumes this backup resource doesn't exist, which leads to unrealistic results which should never occur in production (if your architecture is designed correctly).

On your various points across these issues:

I agree that this implementation doesn't make a lot of sense - I did start working on fixing it and making MAX_RETRY_SECONDS configurable, but getting the dynamic input into the decorator (which needs to happen before the class is instantiated) requires some extra work/function wrappers. Because of this, I ended up not correcting it as I wanted to minimize the complexity & number of changes in my PR to maximise the chance of getting accepted. Its definitely a possible idea though, but with the ideas discussed up above, I think it is kind of pointless (in my personal opinion).
On the 60 second default, this could definitely be longer, but in practice it's just changing the ceiling we are putting on requests - it isn't going to allow any more requests to be processed per minute, and doesn't really match how we would use the 429s in a real-world application.
As for what the ideal solution should be, realistically the retry only makes sense if we have the ability to run variable load through the endpoint (in order to show off the benefits of both the 429 retry logic, and the bucket of burst capacity that each PTU-M instance has). It starts to get into the weeds at that point, but you can always go build it if you like! Perhaps some tweaks to the rate limiter which adds an oscillating multiplier to the rate limit? e.g. a cycle length of 120 seconds and an oscillation factor of 0.2 would result in going from 100rpm -> 80 rpm -> 100 rpm -> 120 rpm -> 100 rpm over the course of 120 seconds. It's an idea, but it really starts getting into the weeds and I don't know how valuable it would be.

sushaanttb commented 6 months ago

Thanks @michaeltremeer for the detailed comments!

I basically tested this tool on a non-PTU deployment (as I do not have a PTU based deployment yet) & by purposefully also building the scenario to hit the rate limits and see the throttling numbers. I actually wanted to understand the output in a bit more detail.

Now, after your comment, I again had a look on the PTU's documentations, but I couldn't find any note on PTU-M & PTU-C specifically. For ref., these are the documentation links which I referred Getting started with PTU & What is provisioned throughput?.

& when selecting the model deployment type as well, I am getting following option for PTU , it only lists Provisioned-Managed as the option.

Hence, difficult for me to comment on those points before actually knowing about it, but still thanks for the details & it will be helpful if you can share any such reference link on PTU-M and PTU-C.

However, I agree with your thought in your next comment that the retry logic doesn't seems much helpful as it's just creating an ever-increasing queue of requests to be retried (because once a 429 error is received, even if we are parking the current request to retry after some time, the other requests would still keep on happening to the service , without respecting to the notion that the service has already hit it's limit & thus they would also go through that same retry logic) Hence, I raised this issue of decoupling both the retry logics and making them configurable so that at least the control is transferred to the user to decide whether they want to test the tool with any retry strategy or not and thus no hidden/additional retries happening under the hood.

Subsequently, thanks for clarifying on the reason for keeping MAX_RETRY_SECONDS hardcoded, could you please also reply the same in this Issue36 which I had raised for same? so that anyone looking there on that issue can know this reason.

Lastly, I also had a question about the README- The README says that this tool is meant for benchmarking PTU based deployment, but I guess this tool can be used for normal non-PTU deployments as well, right? The only additional metric we are getting in case of PTU based deployment is the "azure-openai-deployment-utilization" header, rest everything else seems generic. As the "retry-header" should come in case of normal deployment as well (haven't tested it but this is my hunch so far that any good service like Azure OpenAI would at atleast be sending the standard "retry-after" header for 429 errors) .Thoughts?

Azure / azure-openai-benchmark

Decoupling Throttling retries logic & Throttling backoff logic. #34