Password randomization failure blocks ssh key install

andrewhamon commented 3 weeks ago

If I make an AMI where I create a password, subsequent runs of ec2-macos-init will fail before they ever get a chance to install the default ssh key.

Here is an example run:

2024/08/22 23:47:30.984552 Fetching instance ID from IMDS...
2024/08/22 23:47:30.987089 Running on instance i-0bf7783cd5199dc8d
2024/08/22 23:47:30.987130 Reading init config...
2024/08/22 23:47:30.989063 Successfully read init config
2024/08/22 23:47:30.989097 Validating config...
2024/08/22 23:47:30.989257 Successfully validated config
2024/08/22 23:47:30.989268 Prioritizing modules...
2024/08/22 23:47:30.989290 Successfully prioritized modules
2024/08/22 23:47:30.989299 Creating instance history directories for current instance...
2024/08/22 23:47:30.989585 Successfully created directories
2024/08/22 23:47:30.989598 Getting instance history...
2024/08/22 23:47:30.989782 Successfully gathered instance history
2024/08/22 23:47:30.989793 Processing priority level 1 (2 modules)...
2024/08/22 23:47:30.989819 Running module [UnmountLocalSSD] (type: command, group: 1)
2024/08/22 23:47:30.989834 Running module [DisableEthernet] (type: command, group: 1)
2024/08/22 23:47:31.037840 Successfully completed module [DisableEthernet] (type: command, group: 1) with message: successfully ran command [[/usr/sbin/networksetup -setnetworkserviceenabled Ethernet off]] with stdout [] and stderr []
2024/08/22 23:47:31.697122 Successfully completed module [UnmountLocalSSD] (type: command, group: 1) with message: successfully ran command [[/bin/zsh -c diskutil list internal physical | egrep -o '^/dev/disk\d+' | xargs diskutil eject || true]] with stdout [] and stderr [Volume failed to eject]
2024/08/22 23:47:31.698805 Successfully completed processing of priority level 1
2024/08/22 23:47:31.698834 Processing priority level 2 (1 modules)...
2024/08/22 23:47:31.698893 Running module [CheckNetworkIsUp] (type: networkcheck, group: 2)
2024/08/22 23:47:31.738534 Successfully completed module [CheckNetworkIsUp] (type: networkcheck, group: 2) with message: successfully pinged default gateway with a RTT of 266.667µs
2024/08/22 23:47:31.738626 Successfully completed processing of priority level 2
2024/08/22 23:47:31.738647 Processing priority level 3 (12 modules)...
2024/08/22 23:47:31.738696 Running module [GrowRootAPFSVolume] (type: command, group: 3)
2024/08/22 23:47:31.738839 Running module [NeverSleep] (type: command, group: 3)
2024/08/22 23:47:31.738848 Running module [ManageEC2User] (type: usermanagement, group: 3)
2024/08/22 23:47:31.738884 Running module [UpdateMOTD] (type: motd, group: 3)
2024/08/22 23:47:31.738950 Running module [SetDefaultTimezone] (type: command, group: 3)
2024/08/22 23:47:31.739207 Running module [EC2SuggestedDefaultConfigPerformance] (type: systemconfig, group: 3)
2024/08/22 23:47:31.739451 Running module [SetAmazonTimeSync] (type: command, group: 3)
2024/08/22 23:47:31.739488 Running module [NeverSleepDisplay] (type: command, group: 3)
2024/08/22 23:47:31.739657 Running module [DisableSleep] (type: command, group: 3)
2024/08/22 23:47:31.739639 Running module [EC2SuggestedDefaultConfigSecurity] (type: systemconfig, group: 3)
2024/08/22 23:47:31.739851 Running module [RemoveSSHGroup] (type: command, group: 3)
2024/08/22 23:47:31.740006 Running module [DisableWiFi] (type: command, group: 3)
2024/08/22 23:47:31.753394 Error while running module [GrowRootAPFSVolume] (type: command, group: 3) with message:  and err: ec2macosinit: error executing command [[/bin/zsh -c ec2-macos-utils grow --id root]] with stdout [] and stderr [zsh:1: command not found: ec2-macos-utils]: exit status 127
2024/08/22 23:47:31.762102 Did not modify sysctl property [kern.aioprocmax=256]
2024/08/22 23:47:31.762206 Did not modify sysctl property [net.inet.tcp.autorcvbufmax=33554432]
2024/08/22 23:47:31.766793 Did not modify sysctl property [kern.aiomax=900]
2024/08/22 23:47:31.766774 Did not modify sysctl property [net.inet.tcp.win_scale_factor=8]
2024/08/22 23:47:31.767872 Did not modify sysctl property [kern.aiothreads=64]
2024/08/22 23:47:31.768859 Did not modify sysctl property [net.inet.tcp.recvspace=1048576]
2024/08/22 23:47:31.769714 Did not modify sysctl property [net.inet.tcp.autosndbufmax=33554432]
2024/08/22 23:47:31.769886 Did not modify sysctl property [net.inet.tcp.sendspace=1048576]
2024/08/22 23:47:31.774912 Did not modify sysctl property [net.link.generic.system.rcvq_maxlen=1024]
2024/08/22 23:47:31.792731 Did not modify SSHD configuration
2024/08/22 23:47:31.848976 Did not modify default [ConfigDataInstall]
2024/08/22 23:47:31.849036 Did not modify default [AutomaticallyInstallMacOSUpdates]
2024/08/22 23:47:31.849161 Did not modify default [AutomaticDownload]
2024/08/22 23:47:31.849292 Did not modify default [AutomaticCheckEnabled]
2024/08/22 23:47:31.884556 Successfully completed module [EC2SuggestedDefaultConfigSecurity] (type: systemconfig, group: 3) with message: system configuration completed with [0 changed / 1 unchanged /0 error(s)] out of 1 requested changes
2024/08/22 23:47:31.890440 Successfully completed module [UpdateMOTD] (type: motd, group: 3) with message: successfully updated motd file [/etc/motd] with version string [macOS Sonoma 14.5]
2024/08/22 23:47:31.898016 Did not modify default [CriticalUpdateInstall]
2024/08/22 23:47:31.898050 Successfully completed module [EC2SuggestedDefaultConfigPerformance] (type: systemconfig, group: 3) with message: system configuration completed with [0 changed / 14 unchanged / 0 error(s)] out of 14 requested changes
2024/08/22 23:47:31.928354 Successfully completed module [SetDefaultTimezone] (type: command, group: 3) with message: successfully ran command [[systemsetup -settimezone GMT]] with stdout [Set TimeZone: GMT] and stderr [2024-08-22 23:47:31.927 systemsetup[10242:88077] ### Error:-99 File:/AppleInternal/Library/BuildRoots/91a344b1-f985-11ee-b563-fe8bc7981bff/Library/Caches/com.apple.xbs/Sources/Admin/InternetServices.m Line:379]
2024/08/22 23:47:31.934127 Successfully completed module [RemoveSSHGroup] (type: command, group: 3) with message: successfully ran command [[/bin/zsh -c dscl /Local/Default delete /Groups/com.apple.access_ssh || true]] with stdout [delete: Invalid Path] and stderr [<dscl_cmd> DS Error: -14009 (eDSUnknownNodeName)]
2024/08/22 23:47:31.961197 Successfully completed module [DisableWiFi] (type: command, group: 3) with message: successfully ran command [[/bin/zsh -c wifidevice="$(networksetup -listallhardwareports |grep -A 1 "Wi-Fi" | tail -n 1 | cut -d " " -f2)"; if [[ ! -z $wifidevice ]]; then networksetup -setairportpower $wifidevice off; fi]] with stdout [] and stderr []
2024/08/22 23:47:31.978330 Successfully completed module [NeverSleepDisplay] (type: command, group: 3) with message: successfully ran command [[sudo pmset -a displaysleep 0]] with stdout [] and stderr[]
2024/08/22 23:47:31.981452 Successfully completed module [DisableSleep] (type: command, group: 3) with message: successfully ran command [[sudo pmset -a disablesleep 1]] with stdout [] and stderr []
2024/08/22 23:47:31.997461 Successfully completed module [NeverSleep] (type: command, group: 3) with message: successfully ran command [[sudo pmset -a sleep 0]] with stdout [] and stderr []
2024/08/22 23:47:32.034087 Successfully completed module [SetAmazonTimeSync] (type: command, group: 3) with message: successfully ran command [[systemsetup -setusingnetworktime on -setnetworktimeserver 169.254.169.123]] with stdout [Network Time is already on.
setNetworkTimeServer: 169.254.169.123] and stderr [2024-08-22 23:47:32.033 systemsetup[10259:88082] ### Error:-99 File:/AppleInternal/Library/BuildRoots/91a344b1-f985-11ee-b563-fe8bc7981bff/Library/Caches/com.apple.xbs/Sources/Admin/InternetServices.m Line:379]
2024/08/22 23:47:32.111164 Error while running module [ManageEC2User] (type: usermanagement, group: 3) with message:  and err: ec2macosinit: failed to randomize password: ec2macosinit: unable to set secure password: ec2macosinit: failed to set ec2-user's password: exit status 67
2024/08/22 23:47:32.111209 Successfully completed processing of priority level 3
2024/08/22 23:47:32.111216 Writing instance history for instance i-0bf7783cd5199dc8d...
2024/08/22 23:47:32.133068 Successfully wrote instance history
2024/08/22 23:47:32.140094 Number of fatal retries (101) exceeded, exiting 0 to avoid infinite runs
2024/08/22 23:47:32.140113 Exiting after 1.152981375s due to failure in module [ManageEC2User] with FatalOnError set

It would be great if this failure was a soft failure, and other modules had a chance to complete.

I think the "bug" here is that a failure to set a password is considered a retry-able error and keeps retrying until the 100 retry limit. Then the program exists. I think its mainly bad luck/race conditions that this usually happens before the default ssh key can be installed.

mattcataws commented 2 weeks ago

Hey @andrewhamon, thanks for opening this issue to discuss this behavior you're seeing from ec2-macos-init.

The ManageEC2User module is setup to only run on the first boot of an image - init.toml#L129. Are you running ec2-macos-init clean or removing anything along this path - /usr/local/aws/ec2-macos-init/instances/<instance-id>/history.json? If ec2-macos-init doesn't find a history file, it will assume it's running for the very first time and will execute all modules as such which could explain why you see the module rerun in a derivative AMI.

It would be great if this failure was a soft failure, and other modules had a chance to complete.

The default settings for the ManageEC2User module are to terminate immediately if there's any problem. However, since you're seeing this on a derivative AMI that you've set the ec2-user password on, I'd recommend updating those settings to no longer be fatal.

You can do this by editing the FatalOnError setting in the config file located at /usr/local/aws/ec2-macos-init/init.toml. Here's what it looks like in the default config - init.toml#L130.

You might want a change that looks something like this:

# Set a random password for ec2-user
[[Module]]
    Name = "ManageEC2User"
    PriorityGroup = 3 # Third group
    RunOnce = true # Run only on the first boot
-   FatalOnError = true # Must succeed
+   FatalOnError = false # Don't require success, allow ec2-macos-init to continue running if it fails
    [Module.UserManagement]
        User = "ec2-user" # This user must exist locally in /Users/
        RandomizePassword = true # default is true

I think the "bug" here is that a failure to set a password is considered a retry-able error and keeps retrying

I agree, this is odd behavior and it probably shouldn't continue retrying after failure. As a bit of context, ec2-macos-init creates a temporary file to track the number of fatal errors that it's encountered which can be see here - fatalcount.go#L15.

It's possible that this file is persisting the count for past errors that are no longer applicable to your derivative AMI which is why you see that message at the end of the log file once the ManageEC2User module fails. However, this tracking doesn't include the source of the fatal so we can't tell what's causing this without checking the logs.

Please let me know if this helps and if you have any other comments or questions.

andrewhamon commented 2 weeks ago

The ManageEC2User module is setup to only run on the first boot of an image - init.toml#L129. Are you running ec2-macos-init clean or removing anything along this path

Ahh. Yes, i somewhat blindly followed the advice of this aws blog post and am running ec2-macos-init clean -all in my AMI build. But I am also setting a password in the same AMI build.

For reference - i am building a base AMI for use at my company, but we want all those instances to have a password that we store in secrets manager.

Should I not be running ec2-macos-init clean for this AMI build?

andrewhamon commented 2 weeks ago

My workaround is to change RandomizePassword to false in init.toml.

mattcataws commented 2 weeks ago

Glad to hear you found a workaround, that seems like a better way to ensure the module doesn't try to change the password since you've already set it.

Running ec2-macos-init clean is our recommended method if you want to provide a clean slate for ec2-macos-init in a derivative AMI. The --all flag tells the clean command to also remove all instance history (code link) which will cause ec2-macos-init to think it's running for the first time whenever you launch an instance with that AMI.

If this is your desired outcome, I would recommend updating the init.toml file to tune any of the other modules with this first-run behavior in mind for your AMI. This might not be necessary if the ManageEC2User module was the only one causing problems for you.

Hopefully this helps answer your questions, but please do let me know if there's anything else.

aws / ec2-macos-init

Password randomization failure blocks ssh key install #50