aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.06k stars 322 forks source link

amazon-ssm-agent dies on reboot #306

Closed jacktar283 closed 3 years ago

jacktar283 commented 4 years ago

I can create a CentOS EC2 instance and install the latest SSM agent from the s3 pool:

Install AWS SSM agent on Centos

However, when I reboot the instance, I cannot login via SSM as required (If I start the agent from the command line, I can access the instance as expected). Accessing the instance using a jump host and local SSH connection I can see the following systemctl output:

$ systemctl status amazon-ssm-agent ● amazon-ssm-agent.service - amazon-ssm-agent Loaded: loaded (/etc/systemd/system/amazon-ssm-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Tue 2020-09-08 12:24:07 UTC; 44s ago Process: 1247 ExecStart=/usr/bin/amazon-ssm-agent (code=killed, signal=HUP) Main PID: 1247 (code=killed, signal=HUP)

I can start the agent manually and subsequently login, but the problem persists across reboot, when the agent is "HUP"ed for whatever reason. There are no useful messages in the log files, even when enabled at DEBUG level.

I have tried running the process under strace to identify how/when the process fails, but it doesn't. ie. If I change the /etc/systemd/system/amazon-ssm-agent.service file from:

ExecStart=/usr/bin/amazon-ssm-agent to ExecStart=/usr/bin/strace -o /var/log/amazon/ssm/strace.out -t /usr/bin/amazon-ssm-agent

This results in the process NOT getting killed over a reboot, so the traces don't show anything either.

Does amazon-ssm-agent have any ulimit constraints that are documented anywhere? Are those limits being honoured by systemd? I'm clutching at straws, but once this issue arises it is reproducible. Please advise what logs are needed in support of this issue

I'm running a CentOS7 t2.medium instance with agent 2.3.1644.0

aguman-aws commented 4 years ago

What is the ouput of systemctl status amazon-ssm-agent on first install? Are you able to login via ssm after first install without running manually?

Can you provide systemd configuration as well?

Juberstine commented 4 years ago

We are having the same / similar issue on Amazon Linux 2 as well.

# sudo systemctl status amazon-ssm-agent
● amazon-ssm-agent.service - amazon-ssm-agent
   Loaded: loaded (/etc/systemd/system/amazon-ssm-agent.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Fri 2020-09-11 18:53:20 UTC; 3 days ago
 Main PID: 1970 (code=killed, signal=HUP)

Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] Starting document processing engine...
Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] [EngineProcessor] Starting
Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] [EngineProcessor] Initial processing
Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] Starting message polling
Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] Starting send replies to MDS
Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [instanceID=i-0c3512187e27c0b8b] Starting association polling
Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] [Association] [EngineProcessor] Starting
Sep 11 18:53:19 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] [Association] Launching response handler
Sep 11 18:53:20 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] [Association] [EngineProcessor] Initial processing
Sep 11 18:53:20 REDACTED amazon-ssm-agent[1970]: 2020-09-11 18:53:19 INFO [MessagingDeliveryService] [Association] Initializing association scheduling service
jacktar283 commented 4 years ago

What is the ouput of systemctl status amazon-ssm-agent on first install? Are you able to login via ssm after first install without running manually?

As cited in the original post, systemctl status amazon-ssm-agent is dead by the time I'm able to login/access it, so:

$ sudo systemctl status amazon-ssm-agent ● amazon-ssm-agent.service - amazon-ssm-agent Loaded: loaded (/etc/systemd/system/amazon-ssm-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Tue 2020-09-08 12:24:07 UTC; 44s ago Process: 1247 ExecStart=/usr/bin/amazon-ssm-agent (code=killed, signal=HUP) Main PID: 1247 (code=killed, signal=HUP)

Can you provide systemd configuration as well?

systemd configuration is as provided by the rpm with the exception of the ExecStart being modified to add strace as above to try and workaround/troubleshoot the issue.

Juberstine commented 4 years ago

@jacktar283 SSM Agent V3 is out now. We are going to test if this fixes the issue for us.

Juberstine commented 4 years ago

@jacktar283 My early testing shows this issue may be resolved with V3 on Amazon Linux 2 in cases where we had this issue. I'd say give it a try.

jacktar283 commented 4 years ago

@jacktar283 My early testing shows this issue may be resolved with V3 on Amazon Linux 2 in cases where we had this issue. I'd say give it a try.

Thanks. I haven't had the chance to take a look yet, but it sounds promising.

jacktar283 commented 4 years ago

@Juberstine - I can confirm that it works for me too on CentOS instances - I saw the v2 agent die after a reboot, but when the agent updated to v3 the problem disappeared and the agent works as expected. I'm thinking that this can be closed.