Azure / azure-openai-benchmark

Azure OpenAI benchmarking tool
MIT License
130 stars 54 forks source link

Fix retry logic #21

Closed michaeltremeer closed 9 months ago

michaeltremeer commented 10 months ago

This PR builds upon and resolves #20. By default, when retry=none, the tool will still retry requests up to the hard-coded MAX_RETRY_SECONDS variable (5 seconds). In practice, this means that if a test workload pushes a PTU resource to the point of throttling, requests will still be retried under the hood, resulting in TTFT and e2e latency values that are up to 5 seconds higher than they would be if there truly were no requests being retried (assuming the test is run long enough and with enough volume).

This is a major issue, since if any throttling occurs when testing a PTU resource and with the default retry option (retry=none), the output statistics will show inflated TTFT and e2e latency values, making the PTU performance look worse than in reality. PayGO is not likely to experience the same issue for most real-world workloads, since 429s only occur there due to # of API calls and any throttling due to throughput just results in a single, long-running request.

This PR adjusts the behaviour so that no retries occur when retry=none, increases MAX_RETRY_SECONDS to something more suitable for the occasions where you would set retry=exponential (e.g. real-world chat applications where no backup resource is available for excess traffic), and includes an explanation of when to use each retry option available. The README info is probably too detailed and complex, so feedback here would be great.

technicianted commented 10 months ago

Retrying due to 429 is part of core functionality and is expected behavior. In different offering such as PayGo, latency is implicitly incurred with service load and users have no control over it. With PTU-M, 429 is a backpressure signal telling callers that latency will be violated and it is "up to you to decide if you want to retry". So it makes sense what current implementation does.

This is also why we have a separate throttled counter that indicates if calls are being throttled.

I do agree with you that it may sound a little too implicit. But the solution should not be to consider retried requests as new requests because that's not what users will see in real life.

michaeltremeer commented 10 months ago

Retrying due to 429 is part of core functionality and is expected behavior. In different offering such as PayGo, latency is implicitly incurred with service load and users have no control over it. With PTU-M, 429 is a backpressure signal telling callers that latency will be violated and it is "up to you to decide if you want to retry". So it makes sense what current implementation does.

This is also why we have a separate throttled counter that indicates if calls are being throttled.

This makes perfect sense for when retry=expo, but the problem here is that even when retry=none, there is still retrying going on. if I am a user and set retry=none, my expectation is that the number of retries happening is zero. This is not the case, hence why this is treated as a bug. It also means that any time a benchmark is run with the default of retry=none and where some amount of throttling is occurring, the TTFT & e2e latency statistics are going to be artificially high for the PTU. Quite clearly this is problematic.

To illustrate this, I've run a few different benchmarks on a PTU resource and compared the results between the main branch and this PR fix branch. I also added some code to dump the contents of the stats samples, allowing me to see the average number of call_tries for each request. Note that in some examples, the leaky bucket of the PTU was already full (due to a test having occurred immediately before the test), and in others the leaky bucket is empty, so the number of throttled requests can be a little high or low between runs.

The args for the tests below:

Run retry Number of Throttled Requests TTFT Avg / P95 Average call_tries per Sample
Main, MAX_RETRY_SECONDS=5 none 52 {"avg": 1.111, "95th": 3.839} 2.9
Main, MAX_RETRY_SECONDS=5 expo 70 {"avg": 1.392, "95th": 4.311} 3.3
Main, MAX_RETRY_SECONDS=30 none 4 {"avg": 4.502, "95th": 16.009} 7.5
Main, MAX_RETRY_SECONDS=30 expo 5 {"avg": 8.248, "95th": 19.398} 9.7
This bugfix branch none 36 {"avg": 0.67, "95th": 0.729} 1
This bugfix branch expo 0 {"avg": 9.339, "95th": 27.645} 10.2

Some points to call out:

I am more than happy to share the raw data used to create the above tables, but it seems pretty clear that this is indeed an issue and that we should patch this.

I do agree with you that it may sound a little too implicit. But the solution should not be to consider retried requests as new requests because that's not what users will see in real life.

I agree with this logic, but only when retry=expo. The problem in this PR is not for this case though, but rather to fix things for when retry=none. I've included some examples of when each option makes sense in the README which should help the user differentiate when to use each example.