aws / ec2-macos-init

EC2 macOS Init is the launch daemon used to initialize Mac instances within EC2.
https://aws.amazon.com/ec2/instance-types/mac/
Apache License 2.0
148 stars 19 forks source link

Undocumented retries and inability to remove that functionality #15

Open mschultz-aofl opened 2 years ago

mschultz-aofl commented 2 years ago

By default, the ec2-macos-init application will retry execution of failed sections. This doesn't appear documented anywhere, and can cause unexpected side effects. As an example, I have a script that configures MacOS certificates and registers the instance as a runner. As the line of my shell script failed came after registration, the machine registered itself with gitlab 101 times before finally stopping. This behavior needs to be modifiable and well documented, as it can cause serious side effects if not known.

Example TOML added:

  Name = "Execute_Startup"
  PriorityGroup = 3 # Fourth group
  RunOnce = true # Run once, ever
  FatalOnError = false # Stop running Init if there is an error 
  [Module.Command]
    Cmd = ["/usr/local/aws/ec2-macos-init/startup.sh"] # A simple command
    RunAsUser = "ec2-user" # Run as ec2-user

And the shell script:

#!/bin/bash

curl -L https://www.apple.com/certificateauthority/AppleWWDRCAG3.cer > g3.cer

sudo security authorizationdb write com.apple.trust-settings.admin allow
sudo security add-trusted-cert -d -r trustAsRoot -k /Library/Keychains/System.keychain g3.cer

security set-key-partition-list -S apple-tool:,apple: -s -k "${password}" login.keychain-db

/usr/local/opt/gitlab-runner/bin/gitlab-runner register --non-interactive --url 'REDACTED'--executor 'shell' --run-untagged='false' --tag-list 'mac' --registration-token REDACTED

git lfs install

Note that git lfs install failed due to a missing $HOME env var, causing it to import the certificate many times and register itself with gitlab 100 times.

My recommended fix would be to accept a 'RetryCount' option in the TOML to make this configurable, and explicitly set the default within the TOML, removing the const variable.

mieliespoor commented 2 years ago

@mschultz-aofl a question unrelated to the issue here, but related to your script. The ${password} you use there for the login keychain, would this be the password set by macos-init or where do you get this from? Asking as I have some credentials and certificates that need to be added to the login keychain, but need the password for this to work, so am a bit stuck on that point.

mschultz-aofl commented 2 years ago

I use packer to generate the shell script and toss it in that path, with the password in there. It's a restricted path, and the password is set per machine so it's not really sensitive. Not that the Packer I use resets the password to a known password - it doesn't use the reset script that's a part of this tool.

So the setup is, within packer:

  1. Set password
  2. Reboot
  3. Create /usr/local/aws/ec2-macos-init/startup.sh with the password hard-coded on it
  4. Modify /usr/local/aws/ec2-macos-init/init.toml to remove password reset and enable the script above
  5. Create AMI
  6. Launch ami via Terraform
mattcataws commented 2 years ago

Hey @mschultz-aofl, thanks for opening this issue to track this. My first guess at what's happening is the logic behind ec2-macos-init's RetryOnce option. When this is present in a module, ec2-macos-init will keep attempting to run the module (once per run) until it marks a successful run in the history. See this block in Module.ShouldRun(): https://github.com/aws/ec2-macos-init/blob/79b17eaba10b88c541ce0f023a086f59a3d1930e/lib/ec2macosinit/module.go#L154-L166

I like your suggestion of adding a new config value to set the maximum number of attempts a module is run unsuccessfully. We'll need to do a bit more investigation into that addition as well as this issue as a whole.

mschultz-aofl commented 2 years ago

So I don't think it's there. If you look here:

https://github.com/aws/ec2-macos-init/blob/e83f1a6ad84142b9c5d07e8ed5ad79af1e9c5923/lib/ec2macosinit/config.go#L93-L105

You can see it retries 100 times. I think just modifying this would be the easiest thing to do. However, I'm not too familiar with Go - but as to why it does this, I think either (a) history isn't re-read/modified in memory, so the running process doesn't pick up the failure, or (b), the 100 retry logic bypasses ShouldRun.

mattcataws commented 2 years ago

I don't that's the case for the module that you've provided as it sets FatalOnError = false. When the module fails, ec2-maocs-init will continue to run and that InitConfig.FatalCount type will not be incremented.

You can double check this by seeing what's written in the log file /var/log/amazon/ec2/ec2-macos-init.log. If there's a fatal call due to the Execute_Startup module you have defined, there should be a log line like the following:

Exiting after <runtime> due to failure in module [Execute_Startup] with FatalOnError set