Closed michaeltremeer closed 9 months ago
Retrying due to 429 is part of core functionality and is expected behavior. In different offering such as PayGo, latency is implicitly incurred with service load and users have no control over it. With PTU-M, 429 is a backpressure signal telling callers that latency will be violated and it is "up to you to decide if you want to retry". So it makes sense what current implementation does.
This is also why we have a separate throttled
counter that indicates if calls are being throttled.
I do agree with you that it may sound a little too implicit. But the solution should not be to consider retried requests as new requests because that's not what users will see in real life.
Retrying due to 429 is part of core functionality and is expected behavior. In different offering such as PayGo, latency is implicitly incurred with service load and users have no control over it. With PTU-M, 429 is a backpressure signal telling callers that latency will be violated and it is "up to you to decide if you want to retry". So it makes sense what current implementation does.
This is also why we have a separate
throttled
counter that indicates if calls are being throttled.
This makes perfect sense for when retry=expo
, but the problem here is that even when retry=none
, there is still retrying going on. if I am a user and set retry=none
, my expectation is that the number of retries happening is zero. This is not the case, hence why this is treated as a bug. It also means that any time a benchmark is run with the default of retry=none
and where some amount of throttling is occurring, the TTFT & e2e latency statistics are going to be artificially high for the PTU. Quite clearly this is problematic.
To illustrate this, I've run a few different benchmarks on a PTU resource and compared the results between the main branch and this PR fix branch. I also added some code to dump the contents of the stats samples, allowing me to see the average number of call_tries
for each request. Note that in some examples, the leaky bucket of the PTU was already full (due to a test having occurred immediately before the test), and in others the leaky bucket is empty, so the number of throttled requests can be a little high or low between runs.
The args for the tests below:
Run | retry |
Number of Throttled Requests | TTFT Avg / P95 | Average call_tries per Sample |
---|---|---|---|---|
Main, MAX_RETRY_SECONDS=5 | none | 52 | {"avg": 1.111, "95th": 3.839} | 2.9 |
Main, MAX_RETRY_SECONDS=5 | expo | 70 | {"avg": 1.392, "95th": 4.311} | 3.3 |
Main, MAX_RETRY_SECONDS=30 | none | 4 | {"avg": 4.502, "95th": 16.009} | 7.5 |
Main, MAX_RETRY_SECONDS=30 | expo | 5 | {"avg": 8.248, "95th": 19.398} | 9.7 |
This bugfix branch | none | 36 | {"avg": 0.67, "95th": 0.729} | 1 |
This bugfix branch | expo | 0 | {"avg": 9.339, "95th": 27.645} | 10.2 |
Some points to call out:
retry
, we still have multiple calls occurring in the main branch, which is exactly the opposite of what should be happening when retry=none
.retry=none
rows in the main branch are much higher than in this bugfix branch, because multiple retries are occurring.MAX_RETRY_SECONDS
to 30 seconds instead of 5, we see that the TTFT times for the main branch all start to approach 30 seconds when retry=none
, and the number of call_tries
per request is very high.retry
options. When retry=none
, the count of throttled requests is high, but we only have a single API call per request and our TTFT times reflect what we expect for that single API call. When retry=expo
, we see the number of call_tries
increases, as do the TTFT times for each request (since it is being retried an average of 10.2 times). The number of throttled requests also decreases to 0 since all requests are completed during the retry period.I am more than happy to share the raw data used to create the above tables, but it seems pretty clear that this is indeed an issue and that we should patch this.
I do agree with you that it may sound a little too implicit. But the solution should not be to consider retried requests as new requests because that's not what users will see in real life.
I agree with this logic, but only when retry=expo
. The problem in this PR is not for this case though, but rather to fix things for when retry=none
. I've included some examples of when each option makes sense in the README which should help the user differentiate when to use each example.
This PR builds upon and resolves #20. By default, when
retry=none
, the tool will still retry requests up to the hard-codedMAX_RETRY_SECONDS
variable (5 seconds). In practice, this means that if a test workload pushes a PTU resource to the point of throttling, requests will still be retried under the hood, resulting in TTFT and e2e latency values that are up to 5 seconds higher than they would be if there truly were no requests being retried (assuming the test is run long enough and with enough volume).This is a major issue, since if any throttling occurs when testing a PTU resource and with the default retry option (
retry=none
), the output statistics will show inflated TTFT and e2e latency values, making the PTU performance look worse than in reality. PayGO is not likely to experience the same issue for most real-world workloads, since 429s only occur there due to # of API calls and any throttling due to throughput just results in a single, long-running request.This PR adjusts the behaviour so that no retries occur when
retry=none
, increasesMAX_RETRY_SECONDS
to something more suitable for the occasions where you would setretry=exponential
(e.g. real-world chat applications where no backup resource is available for excess traffic), and includes an explanation of when to use eachretry
option available. The README info is probably too detailed and complex, so feedback here would be great.