awslabs / aws-sdk-rust

AWS SDK for the Rust Programming Language
https://awslabs.github.io/aws-sdk-rust/
Apache License 2.0
3.04k stars 247 forks source link

Hyper on single-threaded runtime with SDK breaks when trying to reuse idle connections #440

Open yujinis opened 2 years ago

yujinis commented 2 years ago

What is the problem?

I wrote a test to continuously check with DescribeCacheCluster API for ElastiCache cluster until it is available. I found it resulted in the error exactly when it repeated 52 times with 10 sec interval only in test (cargo test). That means, it did not result in the error when I executed it in main function (cargo run), literally with the same code. I also can avoid this error using 5 sec interval. It seems the difference is using #[tokio::main] or #[tokio::test]. I also tested with DescribeInstances API with the same structure in the code but it did not result in an error. I generated aws_smithy_http trace log and I just found the API request happened but no response confirmed. I reproduced this on EC2 with Amazon Linux2 and Ubuntu 20.04. I also reproduced this in ap-northeast-1 and us-east-1. I could not find any related issues in tokio or hyper.

Version

aws-sdk-test v0.1.0 (/home/ec2-user/aws-sdk-test) ├── aws-config v0.6.0 │ ├── aws-http v0.6.0 │ │ ├── aws-smithy-http v0.36.0 │ │ │ ├── aws-smithy-types v0.36.0 │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-types v0.6.0 │ │ │ ├── aws-smithy-async v0.36.0 │ │ │ ├── aws-smithy-types v0.36.0 () │ ├── aws-sdk-sso v0.6.0 │ │ ├── aws-endpoint v0.6.0 │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-types v0.6.0 () │ │ ├── aws-http v0.6.0 () │ │ ├── aws-sig-auth v0.6.0 │ │ │ ├── aws-sigv4 v0.6.0 │ │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-types v0.6.0 () │ │ ├── aws-smithy-async v0.36.0 () │ │ ├── aws-smithy-client v0.36.0 │ │ │ ├── aws-smithy-async v0.36.0 () │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-smithy-http-tower v0.36.0 │ │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-http v0.36.0 () │ │ ├── aws-smithy-http-tower v0.36.0 () │ │ ├── aws-smithy-json v0.36.0 │ │ │ └── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-types v0.6.0 () │ ├── aws-sdk-sts v0.6.0 │ │ ├── aws-endpoint v0.6.0 () │ │ ├── aws-http v0.6.0 () │ │ ├── aws-sig-auth v0.6.0 () │ │ ├── aws-smithy-async v0.36.0 () │ │ ├── aws-smithy-client v0.36.0 () │ │ ├── aws-smithy-http v0.36.0 () │ │ ├── aws-smithy-http-tower v0.36.0 () │ │ ├── aws-smithy-query v0.36.0 │ │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-xml v0.36.0 │ │ ├── aws-types v0.6.0 () │ ├── aws-smithy-async v0.36.0 () │ ├── aws-smithy-client v0.36.0 () │ ├── aws-smithy-http v0.36.0 () │ ├── aws-smithy-http-tower v0.36.0 () │ ├── aws-smithy-json v0.36.0 () │ ├── aws-smithy-types v0.36.0 () │ ├── aws-types v0.6.0 () ├── aws-sdk-ec2 v0.6.0 │ ├── aws-endpoint v0.6.0 () │ ├── aws-http v0.6.0 () │ ├── aws-sig-auth v0.6.0 () │ ├── aws-smithy-async v0.36.0 () │ ├── aws-smithy-client v0.36.0 () │ ├── aws-smithy-http v0.36.0 () │ ├── aws-smithy-http-tower v0.36.0 () │ ├── aws-smithy-query v0.36.0 () │ ├── aws-smithy-types v0.36.0 () │ ├── aws-smithy-xml v0.36.0 () │ ├── aws-types v0.6.0 () ├── aws-sdk-elasticache v0.6.0 │ ├── aws-endpoint v0.6.0 () │ ├── aws-http v0.6.0 () │ ├── aws-sig-auth v0.6.0 () │ ├── aws-smithy-async v0.36.0 () │ ├── aws-smithy-client v0.36.0 () │ ├── aws-smithy-http v0.36.0 () │ ├── aws-smithy-http-tower v0.36.0 () │ ├── aws-smithy-query v0.36.0 () │ ├── aws-smithy-types v0.36.0 () │ ├── aws-smithy-xml v0.36.0 () │ ├── aws-types v0.6.0 (*)

Platform

[Amazon Linux 2] : Linux ip-172-31-23-85.ap-northeast-1.compute.internal 5.10.82-83.359.amzn2.x86_64 #1 SMP Tue Nov 30 20:47:14 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux, [Ubuntu 20.04] : Linux ip-172-31-19-97 5.11.0-1022-aws #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

AWS Services

ElastiCache

Description

No response

Logs

Using this : https://github.com/yujinis/aws-sdk-test

Reproduced:

$ RUST_BACKTRACE=full AWS_REGION=ap-northeast-1 RUST_LOG=aws_smithy_http=trace cargo test test_client_elasticache -- --nocapture --test-threads=1

...

# of calls: 52
2022-02-07T15:21:09.371858Z TRACE send_operation{operation="DescribeCacheClusters" service="elasticache"}: aws_smithy_http_tower::dispatch: request=Request { method: POST, uri: https://elasticache.ap-northeast-1.amazonaws.com/, version: HTTP/1.1, headers: {"content-type": "application/x-www-form-urlencoded", "content-length": "78", "user-agent": "aws-sdk-rust/0.6.0 os/linux lang/rust/1.58.1", "x-amz-user-agent": "aws-sdk-rust/0.6.0 api/elasticache/0.6.0 os/linux lang/rust/1.58.1", "x-amz-date": "20220207T152109Z", "authorization": Sensitive, "x-amz-security-token": "IQoJb...(snip)...NlAlA=="}, body: SdkBody { inner: Once(Some(b"Action=DescribeCacheClusters&Version=2015-02-02&CacheClusterId=test-1644246728")), retryable: true } }
Error: Unhandled(DispatchFailure(ConnectorError { err: hyper::Error(IncompleteMessage), kind: Other(Some(TransientError)) }))
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`: the test returned a termination value with a non-zero status code (1) which indicates a failure', /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:195:5
stack backtrace:
   0:     0x55a0f66d072c - std::backtrace_rs::backtrace::libunwind::trace::h09f7e4e089375279
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55a0f66d072c - std::backtrace_rs::backtrace::trace_unsynchronized::h1ec96f1c7087094e
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55a0f66d072c - std::sys_common::backtrace::_print_fmt::h317b71fc9a5cf964
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x55a0f66d072c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::he3555b48e7dfe7f0
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:46:22
   4:     0x55a0f66f69fc - core::fmt::write::h513b07ca38f4fb1b
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/fmt/mod.rs:1149:17
   5:     0x55a0f66c8de5 - std::io::Write::write_fmt::haf8c932b52111354
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/io/mod.rs:1697:15
   6:     0x55a0f66d2440 - std::sys_common::backtrace::_print::h195c38364780a303
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:49:5
   7:     0x55a0f66d2440 - std::sys_common::backtrace::print::hc09dfdea923b6730
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:36:9
   8:     0x55a0f66d2440 - std::panicking::default_hook::{{closure}}::hb2e38ec0d91046a3
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:211:50
   9:     0x55a0f66d1ff5 - std::panicking::default_hook::h60284635b0ad54a8
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:228:9
  10:     0x55a0f66d2af4 - std::panicking::rust_panic_with_hook::ha677a669fb275654
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:606:17
  11:     0x55a0f66d25d0 - std::panicking::begin_panic_handler::{{closure}}::h976246fb95d93c31
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:502:13
  12:     0x55a0f66d0bd4 - std::sys_common::backtrace::__rust_end_short_backtrace::h38077ee5b7b9f99a
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:139:18
  13:     0x55a0f66d2539 - rust_begin_unwind
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:498:5
  14:     0x55a0f55b28d1 - core::panicking::panic_fmt::h35f3a62252ba0fd2
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:107:14
  15:     0x55a0f66f450e - core::panicking::assert_failed_inner::hd6dab456d95c7c08
  16:     0x55a0f6608b3a - core::panicking::assert_failed::h9f5009b1f7161bda
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:145:5
  17:     0x55a0f564767a - test::assert_test_result::h1d7603af7e67be2e
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:195:5
  18:     0x55a0f561eb49 - aws_sdk_test::test_client_elasticache::{{closure}}::h9195ac85a721256c
                               at /home/ec2-user/aws-sdk-test/src/main.rs:61:7
  19:     0x55a0f55c272e - core::ops::function::FnOnce::call_once::h29eec468d3525200
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
  20:     0x55a0f5af10f3 - core::ops::function::FnOnce::call_once::hfcb53c700c0bccab
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
  21:     0x55a0f5af10f3 - test::__rust_begin_short_backtrace::hb22db5130c052d3c
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:585:5
  22:     0x55a0f5aefb74 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hd63f214a2a81e294
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/alloc/src/boxed.rs:1694:9
  23:     0x55a0f5aefb74 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h3374a27362b29e43
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panic/unwind_safe.rs:271:9
  24:     0x55a0f5aefb74 - std::panicking::try::do_call::hc2cb3ed44a599de2
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
  25:     0x55a0f5aefb74 - std::panicking::try::hb504c62909631bc5
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
  26:     0x55a0f5aefb74 - std::panic::catch_unwind::hafd1f822b9064982
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
  27:     0x55a0f5aefb74 - test::run_test_in_process::hdc195a7d4539a9ac
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:608:18
  28:     0x55a0f5aefb74 - test::run_test::run_test_inner::{{closure}}::h54d3d0f4251c62f8
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:500:39
  29:     0x55a0f5aeeef5 - test::run_test::run_test_inner::h46d82ad8de194200
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:538:13
  30:     0x55a0f5aed8c9 - test::run_test::hf694fae6b49e853e
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:572:28
  31:     0x55a0f5ae8252 - test::run_tests::hf829667faf36e368
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:313:17
  32:     0x55a0f5ad0698 - test::console::run_tests_console::hd19acac6259be16a
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/console.rs:290:5
  33:     0x55a0f5ae5ee5 - test::test_main::hd74503f4847d7179
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:124:15
  34:     0x55a0f5ae6ff1 - test::test_main_static::h8c7c9dafc6a8ad5c
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/test/src/lib.rs:143:5
  35:     0x55a0f55dc223 - aws_sdk_test::main::h8c2a537617b230e3
                               at /home/ec2-user/aws-sdk-test/src/main.rs:1:1
  36:     0x55a0f55c276b - core::ops::function::FnOnce::call_once::ha5b1cde2bda70a8c
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:227:5
  37:     0x55a0f561282e - std::sys_common::backtrace::__rust_begin_short_backtrace::h9ddd14adfa2b3be5
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:123:18
  38:     0x55a0f560e871 - std::rt::lang_start::{{closure}}::h287ab8cb4fbc06c6
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/rt.rs:145:18
  39:     0x55a0f66d03bb - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h7e688d7cdfeb7e00
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/ops/function.rs:259:13
  40:     0x55a0f66d03bb - std::panicking::try::do_call::h4be824d2350b44c9
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
  41:     0x55a0f66d03bb - std::panicking::try::h0a6fc7affbe5088d
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
  42:     0x55a0f66d03bb - std::panic::catch_unwind::h22c320f732ec805e
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
  43:     0x55a0f66d03bb - std::rt::lang_start_internal::{{closure}}::hd38309c108fe679d
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/rt.rs:128:48
  44:     0x55a0f66d03bb - std::panicking::try::do_call::h8fcaf501f097a28e
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:406:40
  45:     0x55a0f66d03bb - std::panicking::try::h20e906825f98acc1
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:370:19
  46:     0x55a0f66d03bb - std::panic::catch_unwind::h8c5234dc632124ef
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panic.rs:133:14
  47:     0x55a0f66d03bb - std::rt::lang_start_internal::hc4dd8cd3ec4518c2
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/rt.rs:128:20
  48:     0x55a0f560e840 - std::rt::lang_start::hcec16f7f28737922
                               at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/rt.rs:144:17
  49:     0x55a0f55dc24c - main
  50:     0x7fcb36a1913a - __libc_start_main
  51:     0x55a0f55b309a - _start
  52:                0x0 - <unknown>
FAILED

failures:

failures:
    test_client_elasticache

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 1 filtered out; finished in 540.91s

error: test failed, to rerun pass '--bin aws-sdk-test'

Not reproduced:

$ cargo run
$ AWS_REGION=ap-northeast-1 cargo test test_client_ec2
Velfi commented 2 years ago

NOTE: This might be fixed by https://github.com/awslabs/aws-sdk-rust/issues/160

jdisanti commented 2 years ago

I'm able to reproduce the issue with the sample you provided, consistently on request 52 like you said. I also can't reproduce it when it's run in main.

With trace logging enabled, I see:

2022-02-15T00:42:29.094213Z TRACE send_operation{operation="DescribeCacheClusters" service="elasticache"}: aws_smithy_http_tower::dispatch: request=Request { method: POST, uri: https://elasticache.us-east-2.amazonaws.com/, version: HTTP/1.1, headers: {"content-type": "application/x-www-form-urlencoded", "content-length": "78", "user-agent": "aws-sdk-rust/0.6.0 os/linux lang/rust/1.56.1", "x-amz-user-agent": "aws-sdk-rust/0.6.0 api/elasticache/0.6.0 os/linux lang/rust/1.56.1", "x-amz-date": "20220215T004229Z", "authorization": Sensitive, "x-amz-security-token": "redacted"}, body: SdkBody { inner: Once(Some(b"Action=DescribeCacheClusters&Version=2015-02-02&CacheClusterId=test-1644885188")), retryable: true } }
2022-02-15T00:42:29.094356Z TRACE send_operation{operation="DescribeCacheClusters" service="elasticache"}: hyper::client::pool: take? ("https", elasticache.us-east-2.amazonaws.com): expiration = Some(90s)
2022-02-15T00:42:29.094415Z DEBUG send_operation{operation="DescribeCacheClusters" service="elasticache"}: hyper::client::pool: reuse idle connection for ("https", elasticache.us-east-2.amazonaws.com)
2022-02-15T00:42:29.094611Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=POST, body=Some(Known(78))
2022-02-15T00:42:29.094694Z TRACE hyper::proto::h1::encode: sized write, len = 78
2022-02-15T00:42:29.094716Z TRACE hyper::proto::h1::io: buffer.flatten self.len=1363 buf.len=78
2022-02-15T00:42:29.094829Z DEBUG hyper::proto::h1::io: flushed 1441 bytes
2022-02-15T00:42:29.094850Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2022-02-15T00:42:29.094971Z TRACE hyper::proto::h1::conn: Conn::read_head
2022-02-15T00:42:29.095031Z TRACE hyper::proto::h1::io: received 0 bytes
2022-02-15T00:42:29.095052Z TRACE hyper::proto::h1::io: parse eof
2022-02-15T00:42:29.095064Z TRACE hyper::proto::h1::conn: State::close_read()
2022-02-15T00:42:29.095075Z DEBUG hyper::proto::h1::conn: parse error (connection closed before message completed) with 0 bytes
2022-02-15T00:42:29.095085Z DEBUG hyper::proto::h1::dispatch: read_head error: connection closed before message completed
2022-02-15T00:42:29.095108Z TRACE hyper::proto::h1::conn: State::close_read()
2022-02-15T00:42:29.095139Z TRACE hyper::proto::h1::conn: State::close_write()
2022-02-15T00:42:29.095156Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
2022-02-15T00:42:29.095181Z DEBUG rustls::session: Sending warning alert CloseNotify    
2022-02-15T00:42:29.095308Z TRACE hyper::proto::h1::conn: shut down IO complete
2022-02-15T00:42:29.095341Z TRACE mio::poll: deregistering event source from poller    
2022-02-15T00:42:29.095401Z TRACE want: signal: Closed    
Error: Unhandled(DispatchFailure(ConnectorError { err: hyper::Error(IncompleteMessage), kind: Other(Some(TransientError)) }))
thread 'test_client_elasticache' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`: the test returned a termination value with a non-zero status code (1) which indicates a failure', /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5

Not sure what's causing it yet. Will keep investigating.

jdisanti commented 2 years ago

There appear to be two bugs here:

  1. hyper consistently fails calling the ElastiCache service when used with a current_thread Tokio runtime with a reused connection when there is a 6 second delay before reuse. If I disable retry in the SDK, I'm observing:
    create cluster request: success
    describe request 1: success
    sleep 6 seconds
    describe request 2: fails
  2. The SDK is not replenishing its cross-request retry allowance when there is a successful response. This is why it always fails on describe request 52: it's actually failing for describe requests 2-51 (the ones that reuse an idle connection after sleep), but those are successfully retried each time. By the time it gets to request 52, the cross-request retry allowance reaches zero, so the next failure isn't retried and we see the Unhandled(DispatchFailure(ConnectorError { err: hyper::Error(IncompleteMessage), kind: Other(Some(TransientError)) })). What should be happening is that it replenishes the allowance a little bit each time there is a successful response.

I'm working on a fix for 2 and talking with @seanmonstar about 1.

jdisanti commented 2 years ago

The issue of the SDK not replenishing its cross-request retry allowance was fixed with the release of v0.8.0. This should at least improve the reliability of the SDK since it will now retry these failures correctly.

Next steps are to determine if a FIN message is available to hyper before it attempts to reuse the idle connection. If one is available prior to reuse, then there may be a bug in hyper.

With retry fixed in v0.8.0, the easiest way to reproduce this is by disabling retry entirely and use current_thread Tokio runtime. It should fail consistently with that configuration on the first reuse of an idle connection that is 6 seconds old (specifically for aws-sdk-elasticache).

jmklix commented 8 months ago

This sdk has gone GA since this issue was opened. Can you update to the latest version and see if this is still breaking for you?

github-actions[bot] commented 8 months ago

Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.

kumargu commented 8 months ago

Our application; trying to connect to S3 ran into similar problems where the s3-connections were stuck Indefinitely. I don't have error logs from the sdk to prove the theory but I am fairly confident looking at sequence of events that the client was stuck for S3 connections.

I tried reproducing this on my local computer and could indeed see flurry of DispatchFailure DispatchFailure { source: ConnectorError { kind: Other(Some(TransientError)), source: hyper::Error(IncompleteMessage), connection: Unknown } } . These errors are tagged as TransientError and it seems that they don't retry always (the problem was discussed here earlier).

I can confirm that I am running the latest version of --

aws-config = "*"
aws-sdk-s3 = "*"

Here's a code snippet to repro the issue.

#[tokio::main]
async fn main() -> Result<(), Error> {
    let region_provider = RegionProviderChain::default_provider().or_else("us-west-2");
    let config = aws_config::from_env().region(region_provider).load().await;

    let client = Client::new(&config);

    for outer in 0..100 {
        // prevent from 503s from S3...
        tokio::time::sleep(Duration::from_secs(5)).await;

        let mut s3_key_name: String = "S3ServerAccessLogs/abc_".to_owned();
        let s3_key_tag: String = outer.to_string().to_owned();
        s3_key_name.push_str(&s3_key_tag);

        let mut futs = Vec::new();

        // the higher the number of iterations, the more are chances of DispatchErrors
        for _ in 0..500 {
            futs.push(tokio::spawn(
                client
                    .put_object()
                    .bucket("<my_bucket>")
                    .key(s3_key_name.clone())
                    .content_type("application/json")
                    .send(),
            ));
        }

        for fut in futs {
            match fut.await {
                Ok(res) => match res {
                    Ok(_res) => {}
                    Err(SdkError::ServiceError(_err)) => {
                        // skip logging
                    }
                    Err(SdkError::DispatchFailure(err)) => {
                        if err.is_timeout() || err.is_io() {
                            println!("DispatchFailure Error is timeout {:?}", err)
                        } else if err.is_other() {
                            println!(
                                "DispatchFailure Error which is retryable failure {:?} {:?}",
                                s3_key_name, err
                            );                
                        } else {
                            println!("DispatchFailure && Unretryable");
                        }
                    }
                    Err(_err) => {
                        // skip logging
                    }
                },
                Err(join_err) => {
                    println!("Join error {}", join_err)
                }
            }
        }
        println!("Object uploaded successfully {} ", s3_key_name);
    }

    Ok(())
}
kumargu commented 8 months ago

ping

landonxjames commented 3 months ago

I was unable to reproduce this issue on my laptop (M1 Mac) using the code snippet above. Has this been resolved for you as well @kumargu? If not it could have to do with hitting resource limits imposed by the OS that I am not hitting.

kumargu commented 3 months ago

I will try to repro this again. However, I can confirm that the issue still exists for us in production.

kumargu commented 3 months ago

We spoke to @Velfi internally, and it was suspected that this could be due to "conflicts between hyper` client's default idle timeout is 90 vs S3's is 20 seconds". To rule it out, we disabled SDK retries, but that hasn't helped us too.

ssenchenko commented 1 month ago

The bug still happens sporadically in prod. The issue is not gone. Any plans to look into it?

Velfi commented 3 weeks ago

@ssenchenko Can you provide an example that reproduces this issue?