awslabs / aws-sdk-rust

AWS SDK for the Rust Programming Language
https://awslabs.github.io/aws-sdk-rust/
Apache License 2.0
2.98k stars 246 forks source link

Standard retry backoff calculation may panic #1133

Closed trueb2 closed 4 months ago

trueb2 commented 5 months ago

Describe the bug

The standard retry calculation may cause panic by attempting to convert an invalid float into a Duration.

Expected Behavior

No panics. The operation should fail with an error gracefully.

Current Behavior

thread 'actix-server worker 7' panicked at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/time.rs:762:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::time::Duration::from_secs_f64::panic_cold_display
   3: <aws_smithy_runtime::client::retries::strategy::standard::StandardRetryStrategy as aws_smithy_runtime_api::client::retries::RetryStrategy>::should_attempt_retry
   4: aws_smithy_runtime::client::orchestrator::try_op::{{closure}}::{{closure}}
   5: <aws_smithy_async::future::timeout::Timeout<T,S> as core::future::future::Future>::poll
   6: <aws_smithy_runtime::client::timeout::MaybeTimeoutFuture<InnerFuture> as core::future::future::Future>::poll
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   8: aws_smithy_runtime::client::orchestrator::invoke_with_stop_point::{{closure}}
   9: aws_sdk_s3::operation::get_object::builders::GetObjectFluentBuilder::send::{{closure}}

Reproduction Steps

Here is an example retry configuration that eventually hit panics on get object requests.

aws-config = { version = "1.1.3", features = ["behavior-version-latest"] }
aws-sdk-s3 = "1.13.0"
      let config: SdkConfig = aws_config::from_env()
          .region(region)
          .endpoint_url(host.clone())
          .retry_config(
              RetryConfig::standard()
                  .with_initial_backoff(Duration::from_millis(1))
                  .with_max_attempts(100)
                  .with_reconnect_mode(
                      aws_sdk_s3::config::retry::ReconnectMode::ReconnectOnTransientError,
                  ),
          )
          .timeout_config(
              TimeoutConfig::builder()
                  .operation_attempt_timeout(Duration::from_secs(180))
                  .operation_timeout(Duration::from_secs(600))
                  .build(),
          )
          .load()
          .await;

      let s3_client: Client = Client::new(&config);

Possible Solution

Err if the duration is invalid before attempting to create the Duration at https://github.com/awslabs/aws-sdk-rust/blob/2f2715b3bc47801b3144dfa3413fa683114e7cb4/sdk/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L150

Additional Information/Context

Timeouts typically happen sporadically over time and depend on external factors. I noticed the panics after many requests ran happily through the common code paths. It is difficult to exercise these code paths in real world tests because 99.999 ... % of requests succeed within N attempts and a reasonable time window.

Version

├── aws-config v1.1.4
│   ├── aws-credential-types v1.1.4
│   │   ├── aws-smithy-async v1.1.4
│   │   ├── aws-smithy-runtime-api v1.1.4
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4
│   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-runtime v1.1.4
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-sigv4 v1.1.4
│   │   │   ├── aws-credential-types v1.1.4 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   │   ├── aws-smithy-http v0.60.4
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4
│   │   │   ├── aws-credential-types v1.1.4 (*)
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-sdk-sso v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4
│   │   │   └── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-http v0.60.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-sdk-ssooidc v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-sdk-sts v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4 (*)
│   │   ├── aws-smithy-query v0.60.4
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-xml v0.60.4
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-smithy-async v1.1.4 (*)
│   ├── aws-smithy-http v0.60.4 (*)
│   ├── aws-smithy-json v0.60.4 (*)
│   ├── aws-smithy-runtime v1.1.4 (*)
│   ├── aws-smithy-runtime-api v1.1.4 (*)
│   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-types v1.1.4 (*)
├── aws-sdk-s3 v1.14.0
│   ├── aws-credential-types v1.1.4 (*)
│   ├── aws-runtime v1.1.4 (*)
│   ├── aws-sigv4 v1.1.4 (*)
│   ├── aws-smithy-async v1.1.4 (*)
│   ├── aws-smithy-checksums v0.60.4
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-smithy-eventstream v0.60.4 (*)
│   ├── aws-smithy-http v0.60.4 (*)
│   ├── aws-smithy-json v0.60.4 (*)
│   ├── aws-smithy-runtime v1.1.4 (*)
│   ├── aws-smithy-runtime-api v1.1.4 (*)
│   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-smithy-xml v0.60.4 (*)
│   ├── aws-types v1.1.4 (*)
│   ├── aws-config v1.1.4 (*)
│   ├── aws-sdk-s3 v1.14.0 (*)

Environment details (OS name and version, etc.)

macOS 14.3

Logs

thread 'actix-server worker 7' panicked at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/time.rs:762:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
   2: core::time::Duration::from_secs_f64::panic_cold_display
   3: <aws_smithy_runtime::client::retries::strategy::standard::StandardRetryStrategy as aws_smithy_runtime_api::client::retries::RetryStrategy>::should_attempt_retry
trueb2 commented 5 months ago

I looked into this a little more. If too many attempts are made, then the backoff will be f64::MAX from

https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L291

f64::MAX then is used immediately causing the panic. The work around for now is to never specify attempts and initial backoffs that could result in f64::MAX calculated.

jdisanti commented 5 months ago

Thank you for filing this issue.

ysaito1001 commented 4 months ago

Hi @trueb2, we're working on a fix to address this panic issue. In the meantime, we have a question for your use case.

In general, specifying the number of retry attempts to be 100 does not seem to be a normal workflow. What is the motivation behind this number? Is that out of necessity or are you simply testing the SDK's behavior to see what happens with the number of attempts being 100?

trueb2 commented 4 months ago

The use case for this is that there can be network failures or outages that are random but eventually should succeed. So the initial retry duration should be very low and many reattempts should be possible. Choosing the appropriate maximum input values that do not induce panic is opaque, so 100 seems a reasonable number of max attempts given that the max backoff is an input and the total operation timeout can bounded by an absolute maximum duration. Exponential backoff does not make sense after a certain number of backoffs, but the other parameters exposed by the SDK should make it possible to avoid panicking or unreasonable delays.

I did find that the high retry attempt limits were out of necessity as I have hit the maximum backoff limit causing this panic multiple times over the last month. I was not testing the SDK's behavior.

ysaito1001 commented 4 months ago

Thank you for your response. https://github.com/smithy-lang/smithy-rs/pull/3621 will fix a panic in a way that it should still allow you to retry the desired number of times.

Exponential backoff does not make sense after a certain number of backoffs,

Correct, once an exponential backoff duration becomes large enough to exceed MAX_BACKOFF (by default 20 seconds), the SDK will use that duration for subsequent retries (jitter is applied to MAX_BACKOFF though).

trueb2 commented 4 months ago

@ysaito1001 Thanks for your thorough resolution!

ysaito1001 commented 4 months ago

The fix is included in release-2024-05-22.

github-actions[bot] commented 4 months ago

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.