Open noelmcgrath opened 3 years ago
Thanks for reaching out. Can you please open a support ticket in AWS Console and attach the full error.log and amazon-ssm-agent.log to assist us in finding the root cause?
https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-agent-logs.html https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case
I cant open a ticket. Attached are the files. Basically we started a ssm maintenance window for paching on 09/11 at 14:11 At 14:24 it singalled for a reboot:
2020-11-09 14:24:43 INFO Received core agent reboot signal
2020-11-09 14:24:43 INFO [ssm-agent-worker] Stopping ssm agent worker
2020-11-09 14:24:43 INFO [ssm-agent-worker] [instanceID=i-04b3ce4e6e53b0b6f] core manager stop requested. Stop type: HardStop
2020-11-09 14:24:43 INFO [ssm-agent-worker] [HealthCheck] stopping update instance health job.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Stopping MessageGatewayService.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing controlchannel with channel Id i-04b3ce4e6e53b0b6f
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing websocket channel connection to: wss://ssmmessages.eu-west-1.amazonaws.com/v1/control-channel/i-04b3ce4e6e53b0b6f?role=subscribe&stream=input
2020-11-09 14:24:48 INFO [ssm-agent-worker] Bye.
We noticed the box came back up, but nothing was happening, so we ssh on to box and noticed the ssm agent service was in stopped state. Looking at event logs it had timed out.
I cant open a ticket. Attached are the files. Basically we started a ssm maintenance window for paching on 09/11 at 14:11 At 14:24 it singalled for a reboot:
2020-11-09 14:24:43 INFO Received core agent reboot signal
2020-11-09 14:24:43 INFO [ssm-agent-worker] Stopping ssm agent worker
2020-11-09 14:24:43 INFO [ssm-agent-worker] [instanceID=i-04b3ce4e6e53b0b6f] core manager stop requested. Stop type: HardStop
2020-11-09 14:24:43 INFO [ssm-agent-worker] [HealthCheck] stopping update instance health job.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Stopping MessageGatewayService.
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing controlchannel with channel Id i-04b3ce4e6e53b0b6f
2020-11-09 14:24:43 INFO [ssm-agent-worker] [MessageGatewayService] Closing websocket channel connection to: wss://ssmmessages.eu-west-1.amazonaws.com/v1/control-channel/i-04b3ce4e6e53b0b6f?role=subscribe&stream=input
2020-11-09 14:24:48 INFO [ssm-agent-worker] Bye.
We noticed the box came back up, but nothing was happening, so we ssh on to box and noticed the ssm agent service was in stopped state. Looking at event logs it had timed out.
Apoligies, I closed by mistake, reopened again
Hey Noelmcgrath, Is this happening consistently or is this the only occurrence of this issue? If this is happening consistently, please turn on debug logging so we can get a little bit more information.
Reading through the agent logs I notice that we have a gap in when we start logging in windows startup so right now, we don't see any indication that the agent is actually attempting to start. I'll make sure we close this gap.
We are limited in what we can do to help you here, we recommend you open a support ticket as pointed out by VishnuKarthikRavindran
Hey Noelmcgrath, Can you recheck if this issue still persists?
@noelmcgrath I went down this road with AWS SSM engineering through support, but ultimately doing 2 things helped solve this issue.
Now why does this happen? I am not 100% sure, but I know it's related to fun CPU stuff windows does on startup + and golang's windows service package startup behavior. Why do I suspect golang? Well Datadog's go-based agent was failing the exact same way. Same timeout message and not starting behavior on EC2 instances.
I would make sure your agents are up to latest since they appear to have addressed some of the agent halting issues, possibly by updating the golang win svc package to latest?
This is still consistently happening even after increasing the default timeout to 60 seconds.
Hi, same issue here : case 9315686391 open
Thanks
In my experience, the only way to reliably mitigate this issue has been to use larger instances. The delayed start and changing the timeout didn't ever fix the issue for me.
@KevinMarquette
I believe the real issue is with the golang windows svc code. Datadog agent is almost exactly the same as SSM agent and has this exact same issue after rebooting windows.
What is the plan to improve the customer experience here until that gets resolved? Not being able to depend on the SSM agent impacts the quality of your service offerings that depend on it.
Have you looked into the data to measure the impact of this issue yet? How many people have this issue and don't know it? What's the failure rate of the agent on Windows hosts after reboot (based on instance size)? I imagine you could track running instances compared to online agents as a starting point.
This may help:
I just spent time troubleshooting and working through this with Amazon support. I had an Automation runbook that I was planning to use to keep some Windows Server 2016 based AMIs up to date. Based largely on this: https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-patch-windows-ami-cli.html
My runbook was intermittently timing out on different update steps. Some times updating a driver. Some times installing Windows updates. Even running with Amazon's Public AMI. Always happening because the SSM agent was timing out and never coming back after a restart. I was even using T3.large instances and followed the advice I'd found here to add a step to my runbook to:
Increase windows service timeout from 30s default to 60s. Set Amazon SSM Agent service to automatic delayed start. (you could probably get away with just this if you wanted)
In the end, the recommendation from Support was to increase that windows service timeout all the way to 60000s. I've been testing with that, and so far so good.
One of the really nice benefits of Windows, I guess, is the agonizingly long boot times after an update. It seems we need to account for that.
This is how I implemented it in my AMI patch Runbook:
{
"name": "AddDelayToSSMAgentStartUp",
"action": "aws:runCommand",
"maxAttempts": 3,
"onFailure": "Abort",
"inputs": {
"DocumentName": "AWS-RunPowerShellScript",
"InstanceIds": [
"{{ LaunchInstance.InstanceIds }}"
],
"Parameters": {
"commands": [
"sc.exe config AmazonSSMAgent start= delayed-auto",
"Set-ItemProperty -Path HKLM:/SYSTEM/CurrentControlSet/Control -Name ServicesPipeTimeout -Value 60000 -Type DWord"
]
}
}
I imagine you could track running instances compared to online agents as a starting point.
@KevinMarquette increasing timeouts was the only workaround and even 60 seconds is not enough in rare occasion. We ended up having to do what you said above. I have a job running every 15 mins checking for machines powered on and the agent has a not connected status, then kick the service.
This is absolutely not an acceptable workaround and I personally do not recommend SSM to anyone for mission critical tasks on windows ec2 instances because of these specific issues. The rate of agents not starting is something like 2-10% out of the box, but still like 1-2% with all the timeouts increased.
Also here is why AWS will likely not be able to solve this issue => https://github.com/golang/go/issues/23479
Any solution for this issue ? I having same issues for Windows instacance now.
The AmazonSSMAgent service failed to start due to the following error:
Someone appears to have a solution https://github.com/shirou/gopsutil/issues/570
@Thor-Bjorgvinsson @Praba-N gotta poke someone at AWS to look into it.
Also looks like it was potentially fixed in https://github.com/aws/amazon-ssm-agent/commit/12d1ec4ae31951314ff03c8b4c12866e7321ba30
but I don't know crap about golang..
I also ran into this issue on Windows Server 2019 instance type: t2.micro
Still seeing this behavior which was shared to support before, any plans to address this?
Having the same issue in multiple accounts, seems to happen on instance launch or after a reboot. Mainly windows server 2019 instances of varying sizes.
It seems to have only just started being an issue on some of our instances. We do update our CloudWatch agents every month so I wonder if there was a recent change that pushes it over that timeout on our smaller instances.
SSM Agent will be looking into this issue and see if we can implement a work around the golang windows svc wrapper behavior.
We are applying patches to our Windows instances using the patch manager function in AWS Systems Manager. We have a patch baseline that is executed against a set of windows instances (each of which are part of a patch group) by executing a maintenance window which in turn executes a run command(AWS-RunPatchBaseline) against each of the instances. However we are finding the following:
The instances in question seem to get patches installed correctly. Executing
wmic qfe list
shows that the patches have been installed on the target machines The target instances are then rebooted after patches are installed The run command remains in progress indefinitely From more investigation we found that the amazon-ssh-agent failed to start when the instances are rebooted. Looking at event logs shows a timeout occured:Get-WinEvent -ProviderName 'Service Control Manager'
Output:
Once we manually restarted the amazon-ssh-agent again the run command completed successfully. This issue is we do not want to have to manually start the amazon-ssh-agent on each instance especially as we have a lot of instances. This suggests that it is not an issue with Persistent Routes either and I have just double checked:
Instance IP 10.1.3.217
Persistent Routes:
Any ideas on what is causing this, i.e. why is the amazon-ssh-agent not starting up successfully after automatic reboot?