Open yujinis opened 2 years ago
NOTE: This might be fixed by https://github.com/awslabs/aws-sdk-rust/issues/160
I'm able to reproduce the issue with the sample you provided, consistently on request 52 like you said. I also can't reproduce it when it's run in main
.
With trace logging enabled, I see:
2022-02-15T00:42:29.094213Z TRACE send_operation{operation="DescribeCacheClusters" service="elasticache"}: aws_smithy_http_tower::dispatch: request=Request { method: POST, uri: https://elasticache.us-east-2.amazonaws.com/, version: HTTP/1.1, headers: {"content-type": "application/x-www-form-urlencoded", "content-length": "78", "user-agent": "aws-sdk-rust/0.6.0 os/linux lang/rust/1.56.1", "x-amz-user-agent": "aws-sdk-rust/0.6.0 api/elasticache/0.6.0 os/linux lang/rust/1.56.1", "x-amz-date": "20220215T004229Z", "authorization": Sensitive, "x-amz-security-token": "redacted"}, body: SdkBody { inner: Once(Some(b"Action=DescribeCacheClusters&Version=2015-02-02&CacheClusterId=test-1644885188")), retryable: true } }
2022-02-15T00:42:29.094356Z TRACE send_operation{operation="DescribeCacheClusters" service="elasticache"}: hyper::client::pool: take? ("https", elasticache.us-east-2.amazonaws.com): expiration = Some(90s)
2022-02-15T00:42:29.094415Z DEBUG send_operation{operation="DescribeCacheClusters" service="elasticache"}: hyper::client::pool: reuse idle connection for ("https", elasticache.us-east-2.amazonaws.com)
2022-02-15T00:42:29.094611Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=POST, body=Some(Known(78))
2022-02-15T00:42:29.094694Z TRACE hyper::proto::h1::encode: sized write, len = 78
2022-02-15T00:42:29.094716Z TRACE hyper::proto::h1::io: buffer.flatten self.len=1363 buf.len=78
2022-02-15T00:42:29.094829Z DEBUG hyper::proto::h1::io: flushed 1441 bytes
2022-02-15T00:42:29.094850Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2022-02-15T00:42:29.094971Z TRACE hyper::proto::h1::conn: Conn::read_head
2022-02-15T00:42:29.095031Z TRACE hyper::proto::h1::io: received 0 bytes
2022-02-15T00:42:29.095052Z TRACE hyper::proto::h1::io: parse eof
2022-02-15T00:42:29.095064Z TRACE hyper::proto::h1::conn: State::close_read()
2022-02-15T00:42:29.095075Z DEBUG hyper::proto::h1::conn: parse error (connection closed before message completed) with 0 bytes
2022-02-15T00:42:29.095085Z DEBUG hyper::proto::h1::dispatch: read_head error: connection closed before message completed
2022-02-15T00:42:29.095108Z TRACE hyper::proto::h1::conn: State::close_read()
2022-02-15T00:42:29.095139Z TRACE hyper::proto::h1::conn: State::close_write()
2022-02-15T00:42:29.095156Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
2022-02-15T00:42:29.095181Z DEBUG rustls::session: Sending warning alert CloseNotify
2022-02-15T00:42:29.095308Z TRACE hyper::proto::h1::conn: shut down IO complete
2022-02-15T00:42:29.095341Z TRACE mio::poll: deregistering event source from poller
2022-02-15T00:42:29.095401Z TRACE want: signal: Closed
Error: Unhandled(DispatchFailure(ConnectorError { err: hyper::Error(IncompleteMessage), kind: Other(Some(TransientError)) }))
thread 'test_client_elasticache' panicked at 'assertion failed: `(left == right)`
left: `1`,
right: `0`: the test returned a termination value with a non-zero status code (1) which indicates a failure', /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5
Not sure what's causing it yet. Will keep investigating.
There appear to be two bugs here:
hyper
consistently fails calling the ElastiCache service when used with a current_thread
Tokio runtime with a reused connection when there is a 6 second delay before reuse. If I disable retry in the SDK, I'm observing:
create cluster request: success
describe request 1: success
sleep 6 seconds
describe request 2: fails
sleep
), but those are successfully retried each time. By the time it gets to request 52, the cross-request retry allowance reaches zero, so the next failure isn't retried and we see the Unhandled(DispatchFailure(ConnectorError { err: hyper::Error(IncompleteMessage), kind: Other(Some(TransientError)) }))
. What should be happening is that it replenishes the allowance a little bit each time there is a successful response.I'm working on a fix for 2 and talking with @seanmonstar about 1.
The issue of the SDK not replenishing its cross-request retry allowance was fixed with the release of v0.8.0. This should at least improve the reliability of the SDK since it will now retry these failures correctly.
Next steps are to determine if a FIN
message is available to hyper
before it attempts to reuse the idle connection. If one is available prior to reuse, then there may be a bug in hyper
.
With retry fixed in v0.8.0, the easiest way to reproduce this is by disabling retry entirely and use current_thread
Tokio runtime. It should fail consistently with that configuration on the first reuse of an idle connection that is 6 seconds old (specifically for aws-sdk-elasticache
).
This sdk has gone GA since this issue was opened. Can you update to the latest version and see if this is still breaking for you?
Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.
Our application; trying to connect to S3 ran into similar problems where the s3-connections were stuck Indefinitely. I don't have error logs from the sdk to prove the theory but I am fairly confident looking at sequence of events that the client was stuck for S3 connections.
I tried reproducing this on my local computer and could indeed see flurry of DispatchFailure
DispatchFailure { source: ConnectorError { kind: Other(Some(TransientError)), source: hyper::Error(IncompleteMessage), connection: Unknown } }
. These errors are tagged as TransientError
and it seems that they don't retry always (the problem was discussed here earlier).
I can confirm that I am running the latest version of --
aws-config = "*" aws-sdk-s3 = "*"
Here's a code snippet to repro the issue.
#[tokio::main]
async fn main() -> Result<(), Error> {
let region_provider = RegionProviderChain::default_provider().or_else("us-west-2");
let config = aws_config::from_env().region(region_provider).load().await;
let client = Client::new(&config);
for outer in 0..100 {
// prevent from 503s from S3...
tokio::time::sleep(Duration::from_secs(5)).await;
let mut s3_key_name: String = "S3ServerAccessLogs/abc_".to_owned();
let s3_key_tag: String = outer.to_string().to_owned();
s3_key_name.push_str(&s3_key_tag);
let mut futs = Vec::new();
// the higher the number of iterations, the more are chances of DispatchErrors
for _ in 0..500 {
futs.push(tokio::spawn(
client
.put_object()
.bucket("<my_bucket>")
.key(s3_key_name.clone())
.content_type("application/json")
.send(),
));
}
for fut in futs {
match fut.await {
Ok(res) => match res {
Ok(_res) => {}
Err(SdkError::ServiceError(_err)) => {
// skip logging
}
Err(SdkError::DispatchFailure(err)) => {
if err.is_timeout() || err.is_io() {
println!("DispatchFailure Error is timeout {:?}", err)
} else if err.is_other() {
println!(
"DispatchFailure Error which is retryable failure {:?} {:?}",
s3_key_name, err
);
} else {
println!("DispatchFailure && Unretryable");
}
}
Err(_err) => {
// skip logging
}
},
Err(join_err) => {
println!("Join error {}", join_err)
}
}
}
println!("Object uploaded successfully {} ", s3_key_name);
}
Ok(())
}
ping
I was unable to reproduce this issue on my laptop (M1 Mac) using the code snippet above. Has this been resolved for you as well @kumargu? If not it could have to do with hitting resource limits imposed by the OS that I am not hitting.
I will try to repro this again. However, I can confirm that the issue still exists for us in production.
We spoke to @Velfi internally, and it was suspected that this could be due to "conflicts between hyper` client's default idle timeout is 90 vs S3's is 20 seconds". To rule it out, we disabled SDK retries, but that hasn't helped us too.
The bug still happens sporadically in prod. The issue is not gone. Any plans to look into it?
@ssenchenko Can you provide an example that reproduces this issue?
What is the problem?
I wrote a test to continuously check with DescribeCacheCluster API for ElastiCache cluster until it is available. I found it resulted in the error exactly when it repeated 52 times with 10 sec interval only in test (cargo test). That means, it did not result in the error when I executed it in main function (cargo run), literally with the same code. I also can avoid this error using 5 sec interval. It seems the difference is using #[tokio::main] or #[tokio::test]. I also tested with DescribeInstances API with the same structure in the code but it did not result in an error. I generated aws_smithy_http trace log and I just found the API request happened but no response confirmed. I reproduced this on EC2 with Amazon Linux2 and Ubuntu 20.04. I also reproduced this in ap-northeast-1 and us-east-1. I could not find any related issues in tokio or hyper.
Version
aws-sdk-test v0.1.0 (/home/ec2-user/aws-sdk-test) ├── aws-config v0.6.0 │ ├── aws-http v0.6.0 │ │ ├── aws-smithy-http v0.36.0 │ │ │ ├── aws-smithy-types v0.36.0 │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-types v0.6.0 │ │ │ ├── aws-smithy-async v0.36.0 │ │ │ ├── aws-smithy-types v0.36.0 () │ ├── aws-sdk-sso v0.6.0 │ │ ├── aws-endpoint v0.6.0 │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-types v0.6.0 () │ │ ├── aws-http v0.6.0 () │ │ ├── aws-sig-auth v0.6.0 │ │ │ ├── aws-sigv4 v0.6.0 │ │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-types v0.6.0 () │ │ ├── aws-smithy-async v0.36.0 () │ │ ├── aws-smithy-client v0.36.0 │ │ │ ├── aws-smithy-async v0.36.0 () │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-smithy-http-tower v0.36.0 │ │ │ │ ├── aws-smithy-http v0.36.0 () │ │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-http v0.36.0 () │ │ ├── aws-smithy-http-tower v0.36.0 () │ │ ├── aws-smithy-json v0.36.0 │ │ │ └── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-types v0.6.0 () │ ├── aws-sdk-sts v0.6.0 │ │ ├── aws-endpoint v0.6.0 () │ │ ├── aws-http v0.6.0 () │ │ ├── aws-sig-auth v0.6.0 () │ │ ├── aws-smithy-async v0.36.0 () │ │ ├── aws-smithy-client v0.36.0 () │ │ ├── aws-smithy-http v0.36.0 () │ │ ├── aws-smithy-http-tower v0.36.0 () │ │ ├── aws-smithy-query v0.36.0 │ │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-types v0.36.0 () │ │ ├── aws-smithy-xml v0.36.0 │ │ ├── aws-types v0.6.0 () │ ├── aws-smithy-async v0.36.0 () │ ├── aws-smithy-client v0.36.0 () │ ├── aws-smithy-http v0.36.0 () │ ├── aws-smithy-http-tower v0.36.0 () │ ├── aws-smithy-json v0.36.0 () │ ├── aws-smithy-types v0.36.0 () │ ├── aws-types v0.6.0 () ├── aws-sdk-ec2 v0.6.0 │ ├── aws-endpoint v0.6.0 () │ ├── aws-http v0.6.0 () │ ├── aws-sig-auth v0.6.0 () │ ├── aws-smithy-async v0.36.0 () │ ├── aws-smithy-client v0.36.0 () │ ├── aws-smithy-http v0.36.0 () │ ├── aws-smithy-http-tower v0.36.0 () │ ├── aws-smithy-query v0.36.0 () │ ├── aws-smithy-types v0.36.0 () │ ├── aws-smithy-xml v0.36.0 () │ ├── aws-types v0.6.0 () ├── aws-sdk-elasticache v0.6.0 │ ├── aws-endpoint v0.6.0 () │ ├── aws-http v0.6.0 () │ ├── aws-sig-auth v0.6.0 () │ ├── aws-smithy-async v0.36.0 () │ ├── aws-smithy-client v0.36.0 () │ ├── aws-smithy-http v0.36.0 () │ ├── aws-smithy-http-tower v0.36.0 () │ ├── aws-smithy-query v0.36.0 () │ ├── aws-smithy-types v0.36.0 () │ ├── aws-smithy-xml v0.36.0 () │ ├── aws-types v0.6.0 (*)
Platform
[Amazon Linux 2] : Linux ip-172-31-23-85.ap-northeast-1.compute.internal 5.10.82-83.359.amzn2.x86_64 #1 SMP Tue Nov 30 20:47:14 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux, [Ubuntu 20.04] : Linux ip-172-31-19-97 5.11.0-1022-aws #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
AWS Services
ElastiCache
Description
No response
Logs
Using this : https://github.com/yujinis/aws-sdk-test
Reproduced:
Not reproduced: