aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.06k stars 326 forks source link

Appears to be no upper limit to exponential backoff when agent connecting to amazonaws.com #468

Closed damonmaria closed 1 year ago

damonmaria commented 2 years ago

We lost Internet connection on our edge servers. When Internet was restored the hybrid SSM agent didn't reconnect. Looking through the logs it seems to be using exponential backoff but ends up waiting for ridiculous amounts of time.

Can there be a maximum retry wait of 5 minutes, or something a bit more sane?

2022-09-18 03:41:49 INFO [CredentialRefresher] Sleeping for 2h31m32s before retrying retrieve credentials
2022-09-18 06:14:50 ERROR [CredentialRefresher] Retrieve credentials produced aws error: RequestError: send request failed
caused by: Post "https://ssm.us-east-1.amazonaws.com/": dial tcp: lookup ssm.us-east-1.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:36079->127.0.0.53:53: i/o timeout
2022-09-18 06:14:50 INFO [CredentialRefresher] Sleeping for 5h3m12s before retrying retrieve credentials
2022-09-18 11:19:31 ERROR [CredentialRefresher] Retrieve credentials produced aws error: RequestError: send request failed
caused by: Post "https://ssm.us-east-1.amazonaws.com/": dial tcp: lookup ssm.us-east-1.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:46366->127.0.0.53:53: i/o timeout
2022-09-18 11:19:31 INFO [CredentialRefresher] Sleeping for 10h17m32s before retrying retrieve credentials
2022-09-18 21:38:32 ERROR [CredentialRefresher] Retrieve credentials produced aws error: RequestError: send request failed
caused by: Post "https://ssm.us-east-1.amazonaws.com/": dial tcp: lookup ssm.us-east-1.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:57151->127.0.0.53:53: i/o timeout
2022-09-18 21:38:32 INFO [CredentialRefresher] Sleeping for 21h1m51s before retrying retrieve credentials
robert-kapala-mox commented 2 years ago

We faced the same issue and we have a confirmation from AWS support that there is a feature of exponential backoff. Having ability to control this behaviour would much needed.

damonmaria commented 2 years ago

This is constantly catching us out as our servers operate at the edge with intermittent Internet access. Surely the fix cannot be hard.

timharris777 commented 1 year ago

We need this to be configurable as well. Here are other issues for the same thing:

479

491

damonmaria commented 1 year ago

We've gone to the extent of having a process poll the SSM log, and if it sees the last line has 'Sleeping' then it restarts the ssm service.

tdekoning93 commented 1 year ago

We've gone to the extent of having a process poll the SSM log, and if it sees the last line has 'Sleeping' then it restarts the ssm service.

Can you maybe provide the solution you build to circumvent this?

Its really stupid that its setup by default this way lol.

damonmaria commented 1 year ago

Can you maybe provide the solution you build to circumvent this?

We do this using Puppet so unlikely it will apply to your situation. But here's the check we do to determine if we need to restart the service: /bin/journalctl -b -u amazon-ssm-agent.service | /bin/tail -n1 | /bin/grep Sleeping

timharris777 commented 1 year ago

Just received confirmation from the AWS team that they are working on this with a target release of late Q3/early Q4.

sluggard76 commented 1 year ago

We have created a feature request accordingly. Please note that we have a backlog of feature requests. We'll prioritize and work on those requests as they come in.

doubleopinter commented 1 year ago

+1 for this feature.