aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.06k stars 322 forks source link

ssm agent failing to start after reboot #320

Open noelmcgrath opened 3 years ago

noelmcgrath commented 3 years ago

We are applying patches to our Windows instances using the patch manager function in AWS Systems Manager. We have a patch baseline that is executed against a set of windows instances (each of which are part of a patch group) by executing a maintenance window which in turn executes a run command(AWS-RunPatchBaseline) against each of the instances. However we are finding the following:

The instances in question seem to get patches installed correctly. Executing wmic qfe list shows that the patches have been installed on the target machines The target instances are then rebooted after patches are installed The run command remains in progress indefinitely From more investigation we found that the amazon-ssh-agent failed to start when the instances are rebooted. Looking at event logs shows a timeout occured:

Get-WinEvent -ProviderName 'Service Control Manager'

Output:

09/11/2020 14:25:56           7000 Error            The AmazonSSMAgent service failed to start due to the following error: …
09/11/2020 14:25:56           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the AmazonSSMAgent service to connect.

Once we manually restarted the amazon-ssh-agent again the run command completed successfully. This issue is we do not want to have to manually start the amazon-ssh-agent on each instance especially as we have a lot of instances. This suggests that it is not an issue with Persistent Routes either and I have just double checked:

Instance IP 10.1.3.217

Persistent Routes:

Network Address          Netmask  Gateway Address  Metric
  169.254.169.254  255.255.255.255         10.1.3.1      15
  169.254.169.250  255.255.255.255         10.1.3.1      15
  169.254.169.251  255.255.255.255         10.1.3.1      15
  169.254.169.249  255.255.255.255         10.1.3.1      15
  169.254.169.123  255.255.255.255         10.1.3.1      15
  169.254.169.253  255.255.255.255         10.1.3.1      15

Any ideas on what is causing this, i.e. why is the amazon-ssh-agent not starting up successfully after automatic reboot?

VishnuKarthikRavindran commented 3 years ago

Thanks for reaching out. Can you please open a support ticket in AWS Console and attach the full error.log and amazon-ssm-agent.log to assist us in finding the root cause?

https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-agent-logs.html https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

noelmcgrath commented 3 years ago

I cant open a ticket. Attached are the files. Basically we started a ssm maintenance window for paching on 09/11 at 14:11 At 14:24 it singalled for a reboot:

2020-11-09 14:24:43 INFO Received core agent reboot signal
2020-11-09 14:24:43 INFO [ssm-agent-worker] Stopping ssm agent worker
2020-11-09 14:24:43 INFO [ssm-agent-worker] [instanceID=i-04b3ce4e6e53b0b6f] core manager stop requested. Stop type: HardStop
2020-11-09 14:24:43 INFO [ssm-agent-worker] [HealthCheck] stopping update instance health job.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Stopping MessageGatewayService.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing controlchannel with channel Id i-04b3ce4e6e53b0b6f
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing websocket channel connection to: wss://ssmmessages.eu-west-1.amazonaws.com/v1/control-channel/i-04b3ce4e6e53b0b6f?role=subscribe&stream=input
2020-11-09 14:24:48 INFO [ssm-agent-worker] Bye.

We noticed the box came back up, but nothing was happening, so we ssh on to box and noticed the ssm agent service was in stopped state. Looking at event logs it had timed out.

amazon-ssm-agent.log errors.log

noelmcgrath commented 3 years ago

I cant open a ticket. Attached are the files. Basically we started a ssm maintenance window for paching on 09/11 at 14:11 At 14:24 it singalled for a reboot:

2020-11-09 14:24:43 INFO Received core agent reboot signal
2020-11-09 14:24:43 INFO [ssm-agent-worker] Stopping ssm agent worker
2020-11-09 14:24:43 INFO [ssm-agent-worker] [instanceID=i-04b3ce4e6e53b0b6f] core manager stop requested. Stop type: HardStop
2020-11-09 14:24:43 INFO [ssm-agent-worker] [HealthCheck] stopping update instance health job.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Stopping MessageGatewayService.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing controlchannel with channel Id i-04b3ce4e6e53b0b6f
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing websocket channel connection to: wss://ssmmessages.eu-west-1.amazonaws.com/v1/control-channel/i-04b3ce4e6e53b0b6f?role=subscribe&stream=input
2020-11-09 14:24:48 INFO [ssm-agent-worker] Bye.

We noticed the box came back up, but nothing was happening, so we ssh on to box and noticed the ssm agent service was in stopped state. Looking at event logs it had timed out.

amazon-ssm-agent.log errors.log

noelmcgrath commented 3 years ago

Apoligies, I closed by mistake, reopened again

Thor-Bjorgvinsson commented 3 years ago

Hey Noelmcgrath, Is this happening consistently or is this the only occurrence of this issue? If this is happening consistently, please turn on debug logging so we can get a little bit more information.

Reading through the agent logs I notice that we have a gap in when we start logging in windows startup so right now, we don't see any indication that the agent is actually attempting to start. I'll make sure we close this gap.

We are limited in what we can do to help you here, we recommend you open a support ticket as pointed out by VishnuKarthikRavindran

ranjrish commented 3 years ago

Hey Noelmcgrath, Can you recheck if this issue still persists?

Regancm commented 3 years ago

@noelmcgrath I went down this road with AWS SSM engineering through support, but ultimately doing 2 things helped solve this issue.

  1. Increase windows service timeout from 30s default to 60s.
  2. Set Amazon SSM Agent service to automatic delayed start. (you could probably get away with just this if you wanted)

Now why does this happen? I am not 100% sure, but I know it's related to fun CPU stuff windows does on startup + and golang's windows service package startup behavior. Why do I suspect golang? Well Datadog's go-based agent was failing the exact same way. Same timeout message and not starting behavior on EC2 instances.

I would make sure your agents are up to latest since they appear to have addressed some of the agent halting issues, possibly by updating the golang win svc package to latest?

Regancm commented 3 years ago

This is still consistently happening even after increasing the default timeout to 60 seconds.

eballetbaz commented 2 years ago

Hi, same issue here : case 9315686391 open

Thanks

KevinMarquette commented 2 years ago

In my experience, the only way to reliably mitigate this issue has been to use larger instances. The delayed start and changing the timeout didn't ever fix the issue for me.

Regancm commented 2 years ago

@KevinMarquette

I believe the real issue is with the golang windows svc code. Datadog agent is almost exactly the same as SSM agent and has this exact same issue after rebooting windows.

KevinMarquette commented 2 years ago

What is the plan to improve the customer experience here until that gets resolved? Not being able to depend on the SSM agent impacts the quality of your service offerings that depend on it.

Have you looked into the data to measure the impact of this issue yet? How many people have this issue and don't know it? What's the failure rate of the agent on Windows hosts after reboot (based on instance size)? I imagine you could track running instances compared to online agents as a starting point.

tmederpq commented 2 years ago

This may help:

I just spent time troubleshooting and working through this with Amazon support. I had an Automation runbook that I was planning to use to keep some Windows Server 2016 based AMIs up to date. Based largely on this: https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-patch-windows-ami-cli.html

My runbook was intermittently timing out on different update steps. Some times updating a driver. Some times installing Windows updates. Even running with Amazon's Public AMI. Always happening because the SSM agent was timing out and never coming back after a restart. I was even using T3.large instances and followed the advice I'd found here to add a step to my runbook to:

Increase windows service timeout from 30s default to 60s. Set Amazon SSM Agent service to automatic delayed start. (you could probably get away with just this if you wanted)

In the end, the recommendation from Support was to increase that windows service timeout all the way to 60000s. I've been testing with that, and so far so good.

One of the really nice benefits of Windows, I guess, is the agonizingly long boot times after an update. It seems we need to account for that.

This is how I implemented it in my AMI patch Runbook:

{
      "name": "AddDelayToSSMAgentStartUp",
      "action": "aws:runCommand",
      "maxAttempts": 3,
      "onFailure": "Abort",
      "inputs": {
        "DocumentName": "AWS-RunPowerShellScript",
        "InstanceIds": [
          "{{ LaunchInstance.InstanceIds }}"
        ],
        "Parameters": {
          "commands": [
            "sc.exe config AmazonSSMAgent start= delayed-auto",
            "Set-ItemProperty -Path HKLM:/SYSTEM/CurrentControlSet/Control -Name ServicesPipeTimeout -Value 60000 -Type DWord"
            ]
        }
}
Regancm commented 2 years ago

I imagine you could track running instances compared to online agents as a starting point.

@KevinMarquette increasing timeouts was the only workaround and even 60 seconds is not enough in rare occasion. We ended up having to do what you said above. I have a job running every 15 mins checking for machines powered on and the agent has a not connected status, then kick the service.

This is absolutely not an acceptable workaround and I personally do not recommend SSM to anyone for mission critical tasks on windows ec2 instances because of these specific issues. The rate of agents not starting is something like 2-10% out of the box, but still like 1-2% with all the timeouts increased.

Regancm commented 2 years ago

Also here is why AWS will likely not be able to solve this issue => https://github.com/golang/go/issues/23479

Praba-N commented 1 year ago

Any solution for this issue ? I having same issues for Windows instacance now.

The AmazonSSMAgent service failed to start due to the following error:

Regancm commented 1 year ago

Someone appears to have a solution https://github.com/shirou/gopsutil/issues/570

@Thor-Bjorgvinsson @Praba-N gotta poke someone at AWS to look into it.

Also looks like it was potentially fixed in https://github.com/aws/amazon-ssm-agent/commit/12d1ec4ae31951314ff03c8b4c12866e7321ba30

but I don't know crap about golang..

mlabuda2 commented 11 months ago

I also ran into this issue on Windows Server 2019 instance type: t2.micro

Rana-Salama commented 3 months ago

Still seeing this behavior which was shared to support before, any plans to address this?

tommydot commented 2 months ago

Having the same issue in multiple accounts, seems to happen on instance launch or after a reboot. Mainly windows server 2019 instances of varying sizes.

It seems to have only just started being an issue on some of our instances. We do update our CloudWatch agents every month so I wonder if there was a recent change that pushes it over that timeout on our smaller instances.

Aperocky commented 2 weeks ago

SSM Agent will be looking into this issue and see if we can implement a work around the golang windows svc wrapper behavior.