Fix retry logic - Githubissues

michaeltremeer commented 10 months ago

This PR builds upon and resolves #20. By default, when retry=none, the tool will still retry requests up to the hard-coded MAX_RETRY_SECONDS variable (5 seconds). In practice, this means that if a test workload pushes a PTU resource to the point of throttling, requests will still be retried under the hood, resulting in TTFT and e2e latency values that are up to 5 seconds higher than they would be if there truly were no requests being retried (assuming the test is run long enough and with enough volume).

This is a major issue, since if any throttling occurs when testing a PTU resource and with the default retry option (retry=none), the output statistics will show inflated TTFT and e2e latency values, making the PTU performance look worse than in reality. PayGO is not likely to experience the same issue for most real-world workloads, since 429s only occur there due to # of API calls and any throttling due to throughput just results in a single, long-running request.

This PR adjusts the behaviour so that no retries occur when retry=none, increases MAX_RETRY_SECONDS to something more suitable for the occasions where you would set retry=exponential (e.g. real-world chat applications where no backup resource is available for excess traffic), and includes an explanation of when to use each retry option available. The README info is probably too detailed and complex, so feedback here would be great.

technicianted commented 10 months ago

Retrying due to 429 is part of core functionality and is expected behavior. In different offering such as PayGo, latency is implicitly incurred with service load and users have no control over it. With PTU-M, 429 is a backpressure signal telling callers that latency will be violated and it is "up to you to decide if you want to retry". So it makes sense what current implementation does.

This is also why we have a separate throttled counter that indicates if calls are being throttled.

I do agree with you that it may sound a little too implicit. But the solution should not be to consider retried requests as new requests because that's not what users will see in real life.

michaeltremeer commented 10 months ago

Retrying due to 429 is part of core functionality and is expected behavior. In different offering such as PayGo, latency is implicitly incurred with service load and users have no control over it. With PTU-M, 429 is a backpressure signal telling callers that latency will be violated and it is "up to you to decide if you want to retry". So it makes sense what current implementation does.

This is also why we have a separate throttled counter that indicates if calls are being throttled.

This makes perfect sense for when retry=expo, but the problem here is that even when retry=none, there is still retrying going on. if I am a user and set retry=none, my expectation is that the number of retries happening is zero. This is not the case, hence why this is treated as a bug. It also means that any time a benchmark is run with the default of retry=none and where some amount of throttling is occurring, the TTFT & e2e latency statistics are going to be artificially high for the PTU. Quite clearly this is problematic.

To illustrate this, I've run a few different benchmarks on a PTU resource and compared the results between the main branch and this PR fix branch. I also added some code to dump the contents of the stats samples, allowing me to see the average number of call_tries for each request. Note that in some examples, the leaky bucket of the PTU was already full (due to a test having occurred immediately before the test), and in others the leaky bucket is empty, so the number of throttled requests can be a little high or low between runs.

The args for the tests below:

duration = 100
aggregation-window = 90
token_profile = 500 / 100
rate = 100 (the max PTU RPM for this token profile is 48, so this ensures we enter throttling territory)
clients = 10

Run	`retry`	Number of Throttled Requests	TTFT Avg / P95	Average `call_tries` per Sample
Main, MAX_RETRY_SECONDS=5	none	52	{"avg": 1.111, "95th": 3.839}	2.9
Main, MAX_RETRY_SECONDS=5	expo	70	{"avg": 1.392, "95th": 4.311}	3.3
Main, MAX_RETRY_SECONDS=30	none	4	{"avg": 4.502, "95th": 16.009}	7.5
Main, MAX_RETRY_SECONDS=30	expo	5	{"avg": 8.248, "95th": 19.398}	9.7
This bugfix branch	none	36	{"avg": 0.67, "95th": 0.729}	1
This bugfix branch	expo	0	{"avg": 9.339, "95th": 27.645}	10.2

Some points to call out:

Regardless of the setting for retry, we still have multiple calls occurring in the main branch, which is exactly the opposite of what should be happening when retry=none.
Subsequently, the TTFT times for the retry=none rows in the main branch are much higher than in this bugfix branch, because multiple retries are occurring.
As an experiment, if we then increase the value of MAX_RETRY_SECONDS to 30 seconds instead of 5, we see that the TTFT times for the main branch all start to approach 30 seconds when retry=none, and the number of call_tries per request is very high.
Looking at the rows for the bugfix branch, we see the kind of behaviour that we would expect based on the names of the two retry options. When retry=none, the count of throttled requests is high, but we only have a single API call per request and our TTFT times reflect what we expect for that single API call. When retry=expo, we see the number of call_tries increases, as do the TTFT times for each request (since it is being retried an average of 10.2 times). The number of throttled requests also decreases to 0 since all requests are completed during the retry period.

I am more than happy to share the raw data used to create the above tables, but it seems pretty clear that this is indeed an issue and that we should patch this.

I do agree with you that it may sound a little too implicit. But the solution should not be to consider retried requests as new requests because that's not what users will see in real life.

I agree with this logic, but only when retry=expo. The problem in this PR is not for this case though, but rather to fix things for when retry=none. I've included some examples of when each option makes sense in the README which should help the user differentiate when to use each example.

Azure / azure-openai-benchmark

Fix retry logic #21