Open xizhem opened 2 months ago
Hey @xizhem, thanks for submitting this issue. I've added it to our backlog.
I found that connection created by SDK client to S3 does not have keep alive header
To clarify, the SDK does not alter the keep-alive in the header, so this must've been happening somewhere else.
Does SDK uses HTTP1 or 2? In hyper documentation, keep-alive looks like enabled by default? https://docs.rs/hyper/latest/hyper/server/conn/http1/struct.Builder.html#method.keep_alive
SDK uses hyper defaults, which is HTTP/1.1 and negotiates to HTTP/2 if the servers wishes to do so.
keep-alive looks like enabled by default?
It is.
Describe the bug
Recently after SDK upgrade to aws-s3-sdk(>1.14), we notice upward trend of S3 GET timeout errors in production. We already ruled out the issue from #1118 . In our case, the error message is
TransientError
due to hitting attempt timeout.There are correlation with connection timeout setting with the number of errors we've seen. There are also correlation with the load that we send to S3 to the number of errors.
Expected Behavior
Our timeout setting is as follow:
connection timeout: default to
3.1
attempt timeout:800ms
operation timeout:2.6s
total attempts:3
We expect S3 request to success during this 2.6s.
Current Behavior
SDK did retry 3 times as we check. But still, we timeout after 3 attempts exhausted.
Smithy orchestrator typically emit halting line before the
TransientError
. We couldn't tell whether connection is established successfully within these 800ms or not, as there is no identifier between hyper logs vs. SDK logs.Reproduction Steps
We ran load test to benchmark S3 client and we found correlation between connection timeout and Timeout errors. The load test is running at max possible of 200 concurrency of S3 gets.
Possible Solution
By using linux ss command to observer socket overview. I found that connection created by SDK client to S3 does not have keep alive header. Note
3.5.87.213:https
is s3 host as I check from herewhereas a typical connection could look like this, notice the keep alive header:
I suspect this issue is due to inefficient usage of connection reuse down at
hyper
layer, i.e. previous active connection are closed by S3 randomly due to the lack of header. But I could be wrong.We also observe this log line appears consistently before the transient error
Additional Information/Context
No response
Version