Open elrob opened 3 months ago
Hi @elrob, looks like a timeout of 5 seconds kicked in before you observed the desired number of retries. Can you try disabling timeout (using TimeoutConfig::disabled()) and pass it to ConfigLoader::timeout_config to see if the behavior matches your expectation?
This is an interesting one! There is a clue here as to what's going on:
TimedOutError(5s)
the TimedOutError in the connector doesn't have any fields—this is actually the timeout coming from identity cache!
That's the actual problem here! We're trying to load credentials and it's timing out after 5 seconds (this is a pretty sensible timeout for credentials, if it's taking longer than 5 seconds to load credentials, there are other problems!)
The error is a little misleading—it looks like a connector error because we never sent the request (we should fix that), but the timeout here is coming from the credentials chain. That's why it's not retried.
cc @ysaito1001 this appears to be the problematic line: https://github.com/smithy-lang/smithy-rs/blob/e3f0de42db727d5419948b03c7d5a3773b07e34b/rust-runtime/aws-smithy-runtime-api/src/client/orchestrator.rs#L273
We are incorrectly assuming that any dispatch failure is a connector error which isn't the case. The error here was actually a failure to load credentials. We need to figure out how to fix the error type here.
@ysaito1001 @rcoh Thank you! Yeah, I also thought it might be coming from the credentials provider. There is some annoying flakey ~malware~ security software installed on my machine which often causes temporary connection issues.
My assumption remains the same though, I'd expect the credentials provider to retry after a timeout up to the configured number of retries. Or if it isn't the credentials provider, I'd expect the client to retry after a timeout up to the configured number of retries. Perhaps the credentials provider STS client needs a separately configured retry configuration? I actually looked for that but didn't work out how I could configure that.
Good eyes @rcoh! True, that's definitely something we want to clarify the error type for.
I'd expect the credentials provider to retry after a timeout up to the configured number of retries.
That depends on the underlying retry classifiers (those that implement the ClassifyRetry trait). For instance, if you do not get a response from STS and fail to load credentials, then I don't think it gets retried regardless of the number of retries configured. Can you turn on tracing_subscriber
e.g. tracing_subscriber::fmt::init()
and see whether an retry occurs or does not retry at all before you error out?
Just wanted to clarify some things here. The timeout is on the top level of loading credentials (at the cache level). It is set to 5 seconds (this is configurable), but we don't really recommend messing with it.
Re: ClassifyRetry
: Credentials retries do exist, but they're configured specifically for each provider and don't operate the same way as general-purpose retries (nor are they configurable in the same way). They will show up in the logs, however.
This is because credentials taking longer than 5 seconds to load almost always is indicative of other problems or anti patterns.
For example, this can occur if you are simultaneously creating a large number of clients or aws_config
instances. There is currently an open issue around credentials being cached per client instead of on the root SdkConfig
(but you still should not be creating a large number of clients!)
You can configure this timeout to be longer, but the timeout is almost certainly not the problem here. I would dive deeper into where your credentials are coming from, how often they're being refreshed, etc. — it's also not out of the question that you're triggering an SDK bug around credentials caching!
Thanks. I haven't seen this issue appear in EKS but it appeared a few times for me when testing something out on my machine. I can live with it assuming it doesn't appear on EKS in production.
I don't seem to see any retries in the logs even with RUST_LOG=trace
.
This example happens locally on my machine if I disable my network connection (turn off wifi).
Whether or not this should be retryable is something I'll leave to you to decide.
{"timestamp":"2024-04-04T11:42:33.950270Z","level":"INFO","fields":{"message":"first credentials will be loaded from Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }","base":"Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:33.950301Z","level":"INFO","fields":{"message":"which will be used to assume a role","role_arn":"RoleArn { role_arn: \"arn:aws:iam::REDACTED\", external_id: None, session_name: Some(\"sso\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.227689Z","level":"WARN","fields":{"message":"failed to load base credentials","error":"an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }))"},"target":"aws_config::profile::credentials","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.228331Z","level":"WARN","fields":{"message":"provider failed to provide credentials","provider":"Profile","error":"an error occurred while loading credentials: an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }) }))"},"target":"aws_config::meta::credentials::chain","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
Leaving this open to make the error message less confusing when credentials providers time out (also tracking in https://github.com/smithy-lang/smithy-rs/issues/2950).
Describe the bug
Today, when testing I've been seeing this error:
This is despite the fact that I have this configuration set:
Expected Behavior
I assume that this error is transient so should be resolved with retries even if my local network is unstable? Or based on my RetryConfig, it should have retried 20 times but I assume it didn't based on how quickly the error appears.
Current Behavior
Occasional errors like this which seem to happen immediately rather than retrying up to 20 times.
Reproduction Steps
Maybe just configure a client as above and run a command with no network connection. The error appears quickly and retries don't seem to be obeyed.
Possible Solution
No response
Additional Information/Context
No response
Version