awslabs / aws-sdk-rust

AWS SDK for the Rust Programming Language
https://awslabs.github.io/aws-sdk-rust/
Apache License 2.0
2.91k stars 245 forks source link

Credentials timeout errors are deceptive #1118

Open elrob opened 3 months ago

elrob commented 3 months ago

Describe the bug

Today, when testing I've been seeing this error:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: TimedOutError(5s), connection: Unknown } })

This is despite the fact that I have this configuration set:

    let aws_sdk_config = aws_config::defaults(BehaviorVersion::latest())
        .retry_config(RetryConfig::standard().with_max_attempts(20))
        .load()
        .await;

Expected Behavior

I assume that this error is transient so should be resolved with retries even if my local network is unstable? Or based on my RetryConfig, it should have retried 20 times but I assume it didn't based on how quickly the error appears.

Current Behavior

Occasional errors like this which seem to happen immediately rather than retrying up to 20 times.

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: TimedOutError(5s), connection: Unknown } })

Reproduction Steps

Maybe just configure a client as above and run a command with no network connection. The error appears quickly and retries don't seem to be obeyed.

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v1.1.9
│   ├── aws-credential-types v1.1.8
│   │   ├── aws-smithy-async v1.2.0
│   │   ├── aws-smithy-runtime-api v1.2.0
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-sigv4 v1.2.0
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-http v0.60.7
│   │   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-sdk-sso v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-http v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-ssooidc v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-sts v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-xml v0.60.7
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-types v1.1.8 (*)
├── aws-sdk-elasticloadbalancingv2 v1.19.0
│   ├── aws-credential-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-query v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-xml v0.60.7 (*)
│   ├── aws-types v1.1.8 (*)


### Environment details (OS name and version, etc.)

macOS 13.6.3 (22G436)

### Logs

_No response_
ysaito1001 commented 3 months ago

Hi @elrob, looks like a timeout of 5 seconds kicked in before you observed the desired number of retries. Can you try disabling timeout (using TimeoutConfig::disabled()) and pass it to ConfigLoader::timeout_config to see if the behavior matches your expectation?

rcoh commented 2 months ago

This is an interesting one! There is a clue here as to what's going on:

TimedOutError(5s)

the TimedOutError in the connector doesn't have any fields—this is actually the timeout coming from identity cache!

That's the actual problem here! We're trying to load credentials and it's timing out after 5 seconds (this is a pretty sensible timeout for credentials, if it's taking longer than 5 seconds to load credentials, there are other problems!)

The error is a little misleading—it looks like a connector error because we never sent the request (we should fix that), but the timeout here is coming from the credentials chain. That's why it's not retried.

cc @ysaito1001 this appears to be the problematic line: https://github.com/smithy-lang/smithy-rs/blob/e3f0de42db727d5419948b03c7d5a3773b07e34b/rust-runtime/aws-smithy-runtime-api/src/client/orchestrator.rs#L273

We are incorrectly assuming that any dispatch failure is a connector error which isn't the case. The error here was actually a failure to load credentials. We need to figure out how to fix the error type here.

elrob commented 2 months ago

@ysaito1001 @rcoh Thank you! Yeah, I also thought it might be coming from the credentials provider. There is some annoying flakey ~malware~ security software installed on my machine which often causes temporary connection issues.

My assumption remains the same though, I'd expect the credentials provider to retry after a timeout up to the configured number of retries. Or if it isn't the credentials provider, I'd expect the client to retry after a timeout up to the configured number of retries. Perhaps the credentials provider STS client needs a separately configured retry configuration? I actually looked for that but didn't work out how I could configure that.

ysaito1001 commented 2 months ago

Good eyes @rcoh! True, that's definitely something we want to clarify the error type for.

I'd expect the credentials provider to retry after a timeout up to the configured number of retries.

That depends on the underlying retry classifiers (those that implement the ClassifyRetry trait). For instance, if you do not get a response from STS and fail to load credentials, then I don't think it gets retried regardless of the number of retries configured. Can you turn on tracing_subscriber e.g. tracing_subscriber::fmt::init() and see whether an retry occurs or does not retry at all before you error out?

rcoh commented 2 months ago

Just wanted to clarify some things here. The timeout is on the top level of loading credentials (at the cache level). It is set to 5 seconds (this is configurable), but we don't really recommend messing with it.

Re: ClassifyRetry: Credentials retries do exist, but they're configured specifically for each provider and don't operate the same way as general-purpose retries (nor are they configurable in the same way). They will show up in the logs, however.

This is because credentials taking longer than 5 seconds to load almost always is indicative of other problems or anti patterns.

For example, this can occur if you are simultaneously creating a large number of clients or aws_config instances. There is currently an open issue around credentials being cached per client instead of on the root SdkConfig (but you still should not be creating a large number of clients!)

You can configure this timeout to be longer, but the timeout is almost certainly not the problem here. I would dive deeper into where your credentials are coming from, how often they're being refreshed, etc. — it's also not out of the question that you're triggering an SDK bug around credentials caching!

elrob commented 2 months ago

Thanks. I haven't seen this issue appear in EKS but it appeared a few times for me when testing something out on my machine. I can live with it assuming it doesn't appear on EKS in production.

elrob commented 2 months ago

I don't seem to see any retries in the logs even with RUST_LOG=trace. This example happens locally on my machine if I disable my network connection (turn off wifi). Whether or not this should be retryable is something I'll leave to you to decide.

{"timestamp":"2024-04-04T11:42:33.950270Z","level":"INFO","fields":{"message":"first credentials will be loaded from Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }","base":"Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:33.950301Z","level":"INFO","fields":{"message":"which will be used to assume a role","role_arn":"RoleArn { role_arn: \"arn:aws:iam::REDACTED\", external_id: None, session_name: Some(\"sso\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.227689Z","level":"WARN","fields":{"message":"failed to load base credentials","error":"an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }))"},"target":"aws_config::profile::credentials","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.228331Z","level":"WARN","fields":{"message":"provider failed to provide credentials","provider":"Profile","error":"an error occurred while loading credentials: an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }) }))"},"target":"aws_config::meta::credentials::chain","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
jdisanti commented 2 months ago

Leaving this open to make the error message less confusing when credentials providers time out (also tracking in https://github.com/smithy-lang/smithy-rs/issues/2950).