Open andrewhamon opened 3 weeks ago
Hey @andrewhamon, thanks for opening this issue to discuss this behavior you're seeing from ec2-macos-init.
The ManageEC2User
module is setup to only run on the first boot of an image - init.toml#L129. Are you running ec2-macos-init clean
or removing anything along this path - /usr/local/aws/ec2-macos-init/instances/<instance-id>/history.json
? If ec2-macos-init doesn't find a history file, it will assume it's running for the very first time and will execute all modules as such which could explain why you see the module rerun in a derivative AMI.
It would be great if this failure was a soft failure, and other modules had a chance to complete.
The default settings for the ManageEC2User
module are to terminate immediately if there's any problem. However, since you're seeing this on a derivative AMI that you've set the ec2-user password on, I'd recommend updating those settings to no longer be fatal.
You can do this by editing the FatalOnError
setting in the config file located at /usr/local/aws/ec2-macos-init/init.toml
. Here's what it looks like in the default config - init.toml#L130.
You might want a change that looks something like this:
# Set a random password for ec2-user
[[Module]]
Name = "ManageEC2User"
PriorityGroup = 3 # Third group
RunOnce = true # Run only on the first boot
- FatalOnError = true # Must succeed
+ FatalOnError = false # Don't require success, allow ec2-macos-init to continue running if it fails
[Module.UserManagement]
User = "ec2-user" # This user must exist locally in /Users/
RandomizePassword = true # default is true
I think the "bug" here is that a failure to set a password is considered a retry-able error and keeps retrying
I agree, this is odd behavior and it probably shouldn't continue retrying after failure. As a bit of context, ec2-macos-init creates a temporary file to track the number of fatal errors that it's encountered which can be see here - fatalcount.go#L15.
It's possible that this file is persisting the count for past errors that are no longer applicable to your derivative AMI which is why you see that message at the end of the log file once the ManageEC2User
module fails. However, this tracking doesn't include the source of the fatal so we can't tell what's causing this without checking the logs.
Please let me know if this helps and if you have any other comments or questions.
The ManageEC2User module is setup to only run on the first boot of an image - init.toml#L129. Are you running ec2-macos-init clean or removing anything along this path
Ahh. Yes, i somewhat blindly followed the advice of this aws blog post and am running ec2-macos-init clean -all
in my AMI build. But I am also setting a password in the same AMI build.
For reference - i am building a base AMI for use at my company, but we want all those instances to have a password that we store in secrets manager.
Should I not be running ec2-macos-init clean
for this AMI build?
My workaround is to change RandomizePassword
to false
in init.toml
.
Glad to hear you found a workaround, that seems like a better way to ensure the module doesn't try to change the password since you've already set it.
Running ec2-macos-init clean
is our recommended method if you want to provide a clean slate for ec2-macos-init in a derivative AMI. The --all
flag tells the clean
command to also remove all instance history (code link) which will cause ec2-macos-init to think it's running for the first time whenever you launch an instance with that AMI.
If this is your desired outcome, I would recommend updating the init.toml
file to tune any of the other modules with this first-run behavior in mind for your AMI. This might not be necessary if the ManageEC2User
module was the only one causing problems for you.
Hopefully this helps answer your questions, but please do let me know if there's anything else.
If I make an AMI where I create a password, subsequent runs of ec2-macos-init will fail before they ever get a chance to install the default ssh key.
Here is an example run:
It would be great if this failure was a soft failure, and other modules had a chance to complete.
I think the "bug" here is that a failure to set a password is considered a retry-able error and keeps retrying until the 100 retry limit. Then the program exists. I think its mainly bad luck/race conditions that this usually happens before the default ssh key can be installed.