Open sushaanttb opened 6 months ago
I am copy-pasting my response from your other issue for anyone else who reads this issue:
Hey @sushaanttb,
Have a look at https://github.com/Azure/azure-openai-benchmark/pull/21 and the discussion within. You will see some more info there about what is going on with different retry settings, and how MAX_RETRY_SECONDS effects the outputs. That PR has not been merged and it doesn't look like it will, so if you want the behaviour you suggest, I would suggest you use my fork which fixes this issue and uses a more suitable value of 60 seconds for MAX_RETRY_SECONDS.
Hi @michaeltremeer Thanks for the comment!
As your PR is now merged after I raised those issues. Hence, looking at the latest code and the discussion you referred : But, I am not sure how that resolves this issue.
So now with your PR, The issue of not retrying the requests at all when retry=none
, is fixed, which is good!
However, when retry=exponential
we are still retrying the request with two different retry logics.
Should we not give that control to user to decide?
Attaching screenshot for ref. as per the latest code now:
Basically, introducing another argument/parameter if user wants to retry using "retry header" strategy instead of governing it too from the same "backoff" check
Hey @sushaanttb, you're spot on that the behaviour here doesn't make sense, and if you follow the logic then the actual outcome for a throttled request on PTU-M will look something like this:
MAX_RETRY_SECONDS
is reached, which is 60 seconds by default. The next call will then fail (since it has exceeded 60 seconds), raise for status, and then the exponential backoff decorator is triggered.if self.backoff...
loop, similar to the first request. When this hits the 60 seconds timer, it fails and raises for status, and now the backoff decorator does not retry it, since it is beyond the 60 retry second limit.retry-after
header rather than the exponential backoff schedule.There's one other important idea to explore though. While in the real-world we are going to see spikey usage across the end-point (which is where the 429 behaviour of PTU-M can be of huge benefit), as of right now this load testing tool delivers a consistent, linear call rate. Because it is linear, if the endpoint can only process 10 RPM but you are sending 15, you will always end up in a situation where the queue of new requests starts to build up until every client has an open request. In this case, it means that all requests are now going to be retrying regularly, and with enough clients (which create a longer queue of waiting requests), almost all requests will end up hitting the 60+60 seconds timeout and failing.
For this reason the whole idea of having retries in this tool is bit of a waste of time, since you either have a rate low enough that you aren't throttling and don't need to retry, or you have the load too high and you are guaranteeing an ever-increasing queue of requests. This means that the only benefit of being able to manually set MAX_RETRY_SECONDS
is that you are effectively setting the ceiling on the max latency that you will see on your requests. Your RPM and TPM won't change, you are just moving the maximum ceiling on your PTU-M request latencies, making it look worse due to longer latency on each request. In the real-world, you would usually use the 429 responses in a couple ways:
In both of the situations above, there is a 'circuit-breaker' that can shed load to a backup resource when load exceeds the resource capacity. The tool in it's current state assumes this backup resource doesn't exist, which leads to unrealistic results which should never occur in production (if your architecture is designed correctly).
On your various points across these issues:
MAX_RETRY_SECONDS
configurable, but getting the dynamic input into the decorator (which needs to happen before the class is instantiated) requires some extra work/function wrappers. Because of this, I ended up not correcting it as I wanted to minimize the complexity & number of changes in my PR to maximise the chance of getting accepted. Its definitely a possible idea though, but with the ideas discussed up above, I think it is kind of pointless (in my personal opinion).Thanks @michaeltremeer for the detailed comments!
I basically tested this tool on a non-PTU deployment (as I do not have a PTU based deployment yet) & by purposefully also building the scenario to hit the rate limits and see the throttling numbers. I actually wanted to understand the output in a bit more detail.
Now, after your comment, I again had a look on the PTU's documentations, but I couldn't find any note on PTU-M & PTU-C specifically. For ref., these are the documentation links which I referred Getting started with PTU & What is provisioned throughput?.
& when selecting the model deployment type as well, I am getting following option for PTU , it only lists Provisioned-Managed
as the option.
Hence, difficult for me to comment on those points before actually knowing about it, but still thanks for the details & it will be helpful if you can share any such reference link on PTU-M and PTU-C.
However, I agree with your thought in your next comment that the retry logic doesn't seems much helpful as it's just creating an ever-increasing queue of requests to be retried (because once a 429 error is received, even if we are parking the current request to retry after some time, the other requests would still keep on happening to the service , without respecting to the notion that the service has already hit it's limit & thus they would also go through that same retry logic) Hence, I raised this issue of decoupling both the retry logics and making them configurable so that at least the control is transferred to the user to decide whether they want to test the tool with any retry strategy or not and thus no hidden/additional retries happening under the hood.
Subsequently, thanks for clarifying on the reason for keeping MAX_RETRY_SECONDS
hardcoded, could you please also reply the same in this Issue36 which I had raised for same? so that anyone looking there on that issue can know this reason.
Lastly, I also had a question about the README- The README says that this tool is meant for benchmarking PTU based deployment, but I guess this tool can be used for normal non-PTU deployments as well, right? The only additional metric we are getting in case of PTU based deployment is the "azure-openai-deployment-utilization" header, rest everything else seems generic. As the "retry-header" should come in case of normal deployment as well (haven't tested it but this is my hunch so far that any good service like Azure OpenAI would at atleast be sending the standard "retry-after" header for 429 errors) .Thoughts?
As checked from the code, we have 2 retry logics for retrying 429 errors:
File: oairequester.py Class: OAIRequester Function: _call
& Currently both these two retry logics are coupled together.
This violates Separation of Concerns principle. Also, not sure if the end user would always want that - i.e. multiple retries for the same request from two different logics. For example, If Backoff strategy is specified by the user, then the same 429 request which was retried in previous code snippet based on RETRY_AFTER_MS_HEADER will again be retried. multiple times again!
Solution Suggestion It would be a good idea that (just like backoff) to also parametrize the retry logic from command line arguments. So that the end user could then accordingly adjust the behavior of retries as per their requirement.