awslabs / aws-sdk-rust

AWS SDK for the Rust Programming Language
https://awslabs.github.io/aws-sdk-rust/
Apache License 2.0
2.91k stars 245 forks source link

Upticks of S3 timeout errors after SDK upgrade #1130

Open xizhem opened 2 months ago

xizhem commented 2 months ago

Describe the bug

Recently after SDK upgrade to aws-s3-sdk(>1.14), we notice upward trend of S3 GET timeout errors in production. We already ruled out the issue from #1118 . In our case, the error message is TransientError due to hitting attempt timeout.

There are correlation with connection timeout setting with the number of errors we've seen. There are also correlation with the load that we send to S3 to the number of errors.

Expected Behavior

Our timeout setting is as follow:

connection timeout: default to 3.1 attempt timeout: 800ms operation timeout: 2.6s total attempts: 3

We expect S3 request to success during this 2.6s.

Current Behavior

SDK did retry 3 times as we check. But still, we timeout after 3 attempts exhausted.

2024-04-16T22:08:15.844Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #1 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 47.57774ms
...(backoff 47.57 then 800ms after)
2024-04-16T22:08:16.693Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #2 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 34.98639ms
...(backoff 34.98 then 800ms after)
2024-04-16T22:08:17.530Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=3 max_attempts=3
2024-04-16T22:08:17.530Z ERROR {redacted} GetObject request failed for key "{redacted}" with error TimeoutError(TimeoutError { source: MaybeTimeoutError { kind: OperationAttempt, duration: 800ms } })
2024-04-16T22:08:17.530Z ERROR {redacted} Get Object request failed: "request has timed out"

Smithy orchestrator typically emit halting line before the TransientError. We couldn't tell whether connection is established successfully within these 800ms or not, as there is no identifier between hyper logs vs. SDK logs.

2024-04-11T21:26:03.775Z DEBUG invoke{service=s3 operation=GetObject}:try_op:try_attempt: aws_smithy_runtime::client::orchestrator: encountered orchestrator error; halting

Reproduction Steps

We ran load test to benchmark S3 client and we found correlation between connection timeout and Timeout errors. The load test is running at max possible of 200 concurrency of S3 gets.

Run 1: 50ms connection timeout

Total errors: 4697.000
Total transactions: 4896.000, Successful S3 transactions 199

Run 2: 400ms connection timeout

Total errors: 3969.000
Total transactions: 9718.000, Successful S3 transactions 5701

Run 3: 1500ms connection timeout

Total errors: 3000.000
Total transactions: 42531.000, Successful S3 transactions 38419

Run 4: 2000ms connection timeout

Total errors: 3201.000
Total transactions: 33302.000,  Successful S3 transactions 29478

Run 5: 3000ms connection timeout

Total errors: 0.000
Total transactions: 128115.000, Successful S3 transactions 118430

Possible Solution

By using linux ss command to observer socket overview. I found that connection created by SDK client to S3 does not have keep alive header. Note 3.5.87.213:https is s3 host as I check from here

ESTAB      0      0       [redacted_ip]:[redacted_port]     3.5.87.213:https users:(("[redacted_user]",pid=3000417,fd=73)) uid:4001 ino:277212367 sk:2b012 cgroup:unreachable:1 <->

whereas a typical connection could look like this, notice the keep alive header:

ESTAB      0      0         [redacted_ip]:[redacted_port]     [redacted_ip]:[redacted_port]    users:(("[redacted_user]",pid=12886,fd=397)) timer:(keepalive,1.580ms,0) uid:99 ino:407978 sk:161ea cgroup:unreachable:1 <->

I suspect this issue is due to inefficient usage of connection reuse down at hyper layer, i.e. previous active connection are closed by S3 randomly due to the lack of header. But I could be wrong.

We also observe this log line appears consistently before the transient error

State { reading: Init, writing: KeepAlive, keep_alive: Busy }

Additional Information/Context

No response

Version

├── aws-config v1.1.10
│   ├── aws-credential-types v1.1.8
│   │   ├── aws-smithy-async v1.2.1
│   │   ├── aws-smithy-runtime-api v1.3.0
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-types v1.1.8
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-runtime v1.1.9
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-sigv4 v1.2.0
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-http v0.60.7
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.9
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-sdk-sso v1.19.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.9 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.2.1
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-http v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.9 (*)
│   ├── aws-sdk-ssooidc v1.19.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.9 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-runtime v1.2.1 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.9 (*)
│   ├── aws-sdk-sts v1.19.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.9 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.2.1 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-xml v0.60.7
│   │   ├── aws-types v1.1.9 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.2.1 (*)
│   ├── aws-smithy-runtime-api v1.3.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-types v1.1.9 (*)
├── aws-credential-types v1.1.8 (*)
├── aws-lc-rs v1.6.4
│   ├── aws-lc-sys v0.14.1
├── aws-sdk-s3 v1.14.0
│   ├── aws-credential-types v1.1.8 (*)
│   ├── aws-runtime v1.1.9 (*)
│   ├── aws-sigv4 v1.2.0 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-checksums v0.60.7
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-eventstream v0.60.4 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.2.1 (*)
│   ├── aws-smithy-runtime-api v1.3.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-xml v0.60.7 (*)
│   ├── aws-types v1.1.9 (*)
├── aws-sdk-sts v1.19.0 (*)
├── aws-smithy-async v1.2.1 (*)
├── aws-smithy-runtime-api v1.3.0 (*)
├── aws-types v1.1.9 (*)
├── aws-config v1.1.10 (*)
├── aws-sdk-s3 v1.14.0 (*)
├── aws-sdk-sts v1.19.0 (*)
├── aws-types v1.1.9 (*)


### Environment details (OS name and version, etc.)

Linux 5.10.210-178.852.amzn2int.x86_64 #1 SMP Tue Feb 27 17:09:26 UTC 2024 x86_64 GNU/Linux

### Logs

For complete tracing logs, please see internal thread: https://amzn-aws.slack.com/archives/C0188A52Z7X/p1712868708040879?thread_ts=1710278104.906559&cid=C0188A52Z7X 
Velfi commented 2 months ago

Hey @xizhem, thanks for submitting this issue. I've added it to our backlog.

ysaito1001 commented 2 months ago

I found that connection created by SDK client to S3 does not have keep alive header

To clarify, the SDK does not alter the keep-alive in the header, so this must've been happening somewhere else.

xizhem commented 2 months ago

Does SDK uses HTTP1 or 2? In hyper documentation, keep-alive looks like enabled by default? https://docs.rs/hyper/latest/hyper/server/conn/http1/struct.Builder.html#method.keep_alive

ysaito1001 commented 2 months ago

SDK uses hyper defaults, which is HTTP/1.1 and negotiates to HTTP/2 if the servers wishes to do so.

keep-alive looks like enabled by default?

It is.