aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.03k stars 323 forks source link

Does SSM Agent exit if cfn-init process is running? #444

Closed lorengordon closed 2 years ago

lorengordon commented 2 years ago

I am seeing some weird behavior where I am hoping you can confirm what the SSM Agent is doing. I have a CFN template where I am using an SSM Association to run the document, AWS-JoinDirectoryServiceDomain. I am also using cfn-init Metadata to apply the rest of the instance config at launch time. This is a Windows instance, and in order to coordinate between the SSM Association and cfn-init, I just have a simple step in the cfn-init steps that waits for the instance to reboot:

        join-domain:
          commands:
            10-join-domain:
              command: powershell.exe -NoLogo -NoProfile -NonInteractive -ExecutionPolicy Bypass  -Command Write-Verbose 'Waiting for SSM to complete domain join, which reboots the instance automatically' -Verbose
              waitAfterCompletion: forever

The waitAfterCompletion value of forever will exit cfn-init and resume after the SSM Association joins the domain and reboots the computer.

For Windows systems only. Specifies how long to wait (in seconds) after a command has finished in case the command causes a reboot. The default value is 60 seconds and a value of "forever" directs cfn-init to exit and resume only after the reboot is complete. Set this value to 0 if you don't want to wait for every command.

This has worked alright, except sometimes the domain-join fails. And that doesn't get communicated back to CloudFormation. Which makes it difficult to coordinate any error handling. So I was testing other values, like '1200', so at least I could get it to fail faster:

              waitAfterCompletion: '1200'

This is where I am seeing the weird behavior. The domain-join never happens. The Run-Command will fail with the mysterious error, "Delivery Timed Out". If I login to the system while testing this, I can see in the cfn-init log that it is waiting for the reboot for the requested 1200 seconds.

2022-05-16 21:34:32,273 [DEBUG] Running command 10-join-domain
2022-05-16 21:34:32,273 [DEBUG] No test for command 10-join-domain
2022-05-16 21:34:33,719 [INFO] Command 10-join-domain succeeded
2022-05-16 21:34:33,719 [DEBUG] Command 10-join-domain output: VERBOSE: Waiting for SSM to complete domain join, which reboots the instance automatically

2022-05-16 21:34:33,719 [INFO] Waiting 1200 seconds for reboot

However, there are no logs from the SSM Agent at all. If I set the value back to forever, then everything works fine.

So my working theory is that the SSM Agent refuses to run when it detects that the cfn-init process is running. Is that true? I can't seem to find any documentation on this behavior, or any interaction with cfn-init.

lorengordon commented 2 years ago

Well that's interesting. I checked the Windows Event Logs. I see an event where the startup type of the Amazon SSM Agent service is changed from Disabled to Automatic (not sure what is making that change, I guess it's just part of the Amazon Windows AMI?). But the service isn't started. If I start the service, then the SSM logs show up and the instance executes the Run-Command to join the domain.

gianniLesl commented 2 years ago

The agent startup does not depend on cfn-init, so if logs are not appearing it means the service hasn't started. If the service doesn't start then Systems Manager will be unable to deliver documents for execution to the agent and the execution will eventually return a status of "Delivery Timed Out".