Closed trueb2 closed 4 months ago
I looked into this a little more. If too many attempts are made, then the backoff will be f64::MAX from
f64::MAX then is used immediately causing the panic. The work around for now is to never specify attempts and initial backoffs that could result in f64::MAX calculated.
Thank you for filing this issue.
Hi @trueb2, we're working on a fix to address this panic issue. In the meantime, we have a question for your use case.
In general, specifying the number of retry attempts to be 100 does not seem to be a normal workflow. What is the motivation behind this number? Is that out of necessity or are you simply testing the SDK's behavior to see what happens with the number of attempts being 100?
The use case for this is that there can be network failures or outages that are random but eventually should succeed. So the initial retry duration should be very low and many reattempts should be possible. Choosing the appropriate maximum input values that do not induce panic is opaque, so 100 seems a reasonable number of max attempts given that the max backoff is an input and the total operation timeout can bounded by an absolute maximum duration. Exponential backoff does not make sense after a certain number of backoffs, but the other parameters exposed by the SDK should make it possible to avoid panicking or unreasonable delays.
I did find that the high retry attempt limits were out of necessity as I have hit the maximum backoff limit causing this panic multiple times over the last month. I was not testing the SDK's behavior.
Thank you for your response. https://github.com/smithy-lang/smithy-rs/pull/3621 will fix a panic in a way that it should still allow you to retry the desired number of times.
Exponential backoff does not make sense after a certain number of backoffs,
Correct, once an exponential backoff duration becomes large enough to exceed MAX_BACKOFF
(by default 20 seconds), the SDK will use that duration for subsequent retries (jitter is applied to MAX_BACKOFF
though).
@ysaito1001 Thanks for your thorough resolution!
The fix is included in release-2024-05-22.
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
Describe the bug
The standard retry calculation may cause panic by attempting to convert an invalid float into a Duration.
Expected Behavior
No panics. The operation should fail with an error gracefully.
Current Behavior
Reproduction Steps
Here is an example retry configuration that eventually hit panics on get object requests.
Possible Solution
Err if the duration is invalid before attempting to create the Duration at https://github.com/awslabs/aws-sdk-rust/blob/2f2715b3bc47801b3144dfa3413fa683114e7cb4/sdk/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L150
Additional Information/Context
Timeouts typically happen sporadically over time and depend on external factors. I noticed the panics after many requests ran happily through the common code paths. It is difficult to exercise these code paths in real world tests because 99.999 ... % of requests succeed within N attempts and a reasonable time window.
Version
Environment details (OS name and version, etc.)
macOS 14.3
Logs