Open OliverKoo opened 3 years ago
Can you see what sysadminctl -secureTokenStatus ec2-user
says?
(I'll go into a little more detail later...)
Edits:
I don't recall seeing this specific problem before, but we have had at least a few people turn up with nix+macOS+VM issues over the past few months. Nothing solid has come out of these reports yet, in part because everyone works around their immediate problem and forgets or loses interest in debugging it.
My general hunch for a while has been that some cloud providers are doing something ~weird (i.e., that desktop users don't do) when they set up their VMs, and that this is what causes the trouble.
I tried Nix out in a VM several weeks back while I was suffering through trying to debug an unrelated issue that required repeatedly updating macOS. While it was a complete pain, I didn't run into any of the issues described so far. But, in the process of researching that, I did stumble on this: https://mrmacintosh.com/securetoken-documentation/
From that description, it's at least plausible that they're setting up accounts without SecureToken, and that this causes the trouble. But, we still need to validate that thesis, find a fix, and figure out if it's practical for us to fix it automagically on install or if it's the sort of thing we'll just have to test for and complain about.
Since there's nothing solid to point to, I'll collect some links to existing discussions about this:
For completeness, here are IRC logs covering this troubleshooting attempt:
however when I reboot the nix vol didn't auto mount (maybe /etc/fstab is no longer used by Catalina?) and /nix is now own by root
fstab still works fine in Catalina (and in Big Sur). Your /nix is probably owned by root because nothing is mounted there. AFAIK, any mount point described by /etc/synthetic.conf will be owned by root until/unless some other user successfully mounts something over it.
@abathur thanks for getting back to me so quickly. this is what I get
ec2-user@ip-10-249-9-18 ~ % sysadminctl -secureTokenStatus ec2-user
]2021-03-16 04:19:36.035 sysadminctl[2351:21074] Secure token is DISABLED for user ec2-user
I am not super familiar with nix, nor mac in general. I am using it now for my new job But I am happy to test drive or debug whatever you need and then document the process.
Correction on my initial report - when I say reboot I actually meant start a brand new mac1.metal instance using the AMI (amazon machine image) created from the first machine. Basically I am using packer to create AMI and provision nix onto the machine that will later be baked into an AMI.
Do you have a "password" for that account? With the caveat that I don't really know what we're doing here, I think you can enable this with something like sysadminctl -secureTokenOn ec2-user -password interactive
, at which point it should prompt you for it.
(I'm basing this on sysadminctl --help
, but the usage isn't terribly clear.)
If sysadminctl -secureTokenStatus
confirms that it is enabled afterwards, I'm curious if the store will mount on reboot without any further changes.
so after enabling the security token
(run as root)
sh-3.2# sudo sysadminctl -secureTokenOn ec2-user -password - -adminUser ec2-user -adminPassword -
Enter password for ec2-user :
Enter password for ec2-user :
2021-03-16 16:00:02.676 sysadminctl[42542:216568] - Done!
sh-3.2# sysadminctl interactive -secureTokenStatus ec2-user
2021-03-16 16:00:18.307 sysadminctl[42590:216908] Secure token is ENABLED for user ec2-user
reboot, then the nix volume is attached. (not sure if this will work with the AMI process tho. Create a image that has security token already generated)
Now I am seeing a different issue on the build machine.
dyld: Library not loaded: /nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib
Referenced from: /Users/ec2-user/.nix-profile/bin/nix
Reason: no suitable image found. Did find:
file system sandbox blocked open() of '/nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib'
/nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib: stat() failed with errno=1
file system sandbox blocked open() of '/nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib'
/bin/bash: line 1: 1188 Abort trap: 6 nix --version
I am running build agent as a daemon by having a plist in /Library/LaunchDaemon. The build agent gets launched when my mac1.metal ec2 machine get booted.
If I login tho and run the build agent locally then nix --version
works as expected
(tried this solution setting sandbox and extra path in my nix.conf)
Progress! This latter issue is at least something others have reported.
I'm curious about the build agent and what you're using it for here? There's also a --daemon install of Nix (which will likely become the only supported install after #4289) that'll run as root and use nixbld users for builds, but maybe you've already ruled it out in your situation?
If you know you need a distinct build agent, I'm curious how your launchdaemon differs from the one a daemon install would use, and whether those differences matter here:
https://github.com/NixOS/nix/blob/master/misc/launchd/org.nixos.nix-daemon.plist.in
If you want something less async, we can also talk in #nix-darwin on IRC
I am using buildkite. So the buildkite-agent is the one invoking nix. I basically set all the nix env in the plist so the agent shell would have nix context (equivalent of doing . /Users/buildkite-agent/.nix-profile/etc/profile.d/nix.sh
).
(I been using ec2-user in our conversation thus far in attempt to simplify the discussion since ec2-user is the default user, but it seems like the nix issue I am seeing is not system wide. Something is funny with nix when I run buildkite-agent daemon) You can replace all instance of ec2-user above with buildkite-agent. I ran and installed nix as buildkite-agent.
this is the plist of buildkite-agent
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<!--
A launchd config for loading buildkite-agent on system boot on OS X
systems, and runs in GUI mode (which allows Xcode UI testing but requires
the user to login)
-->
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.buildkite.buildkite-agent</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/buildkite-agent</string>
<string>start</string>
</array>
<key>KeepAlive</key>
<dict>
<key>SuccessfulExit</key>
<false/>
</dict>
<key>RunAtLoad</key>
<true/>
<key>ProcessType</key>
<string>Interactive</string>
<key>UserName</key>
<string>buildkite-agent</string>
<key>ThrottleInterval</key>
<integer>30</integer>
<key>StandardOutPath</key>
<string>/usr/local/var/log/buildkite-agent.log</string>
<key>StandardErrorPath</key>
<string>/usr/local/var/log/buildkite-agent.error.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/Users/buildkite-agent/.nix-profile/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin</string>
<key>HOME</key>
<string>/Users/buildkite-agent</string>
<key>BUILDKITE_AGENT_CONFIG</key>
<string>/usr/local/etc/buildkite-agent/buildkite-agent.cfg</string>
<key>USER</key>
<string>buildkite-agent</string>
<key>AWS_REGION</key>
<string>us-east-1</string>
<key>NIX_PATH</key>
<string>/Users/buildkite-agent/.nix-defexpr/channels</string>
<key>NIX_PROFILES</key>
<string>/nix/var/nix/profiles/default /Users/buildkite-agent/.nix-profile</string>
<key>NIX_SSL_CERT_FILE</key>
<string>/Users/buildkite-agent/.nix-profile/etc/ssl/certs/ca-bundle.crt</string>
</dict>
</dict>
</plist>
if I launch the daemon sudo launchctl load -w /Library/LaunchDaemons/com.buildkite.buildkite-agent.plist
then I see the dyld: Library not loaded
post above when build
but I can successfully build with nix if I invoke the agent directly after ssh into the machine
sudo su - buildkite-agent
then /usr/local/bin/buildkite-agent start
I will also give the multi user install a shot and report back
If you want something less async, we can also talk in #nix-darwin on IRC
that would be great, can you give me direction or link to the chat?
If you already have an IRC client, you can find us on freenode. If you don't, I gather you can use the webchat via https://webchat.freenode.net/#nix-darwin
after setting BUILDKITE_SHELL to use bin/sh the dyld error went away
now seeing
$ trap 'kill -- $$' INT TERM QUIT; nix --version
| /bin/sh: trap 'kill -- $$' INT TERM QUIT; nix --version: No such file or directory
| 🚨 Error: The command exited with status 127
after setting the /bin/sh as BUILDKITE_SHELL env the nix env vars somehow are not set. (still in plist)
for my personal note - summary from yesterday's discussion
Nix needs /bin/sh to have full disk access. People seem to skirt around this by using a GUI session to add a security exemption (like literally VNC in to the desktop, open the system preferences > security & privacy > privacy > Full Disk Access, unlock, and then add /bin/sh) example. BK agent seem to also use /bin/sh internally
Nix vol not auto mounting after boot seems to be fix by generating a security token for buildkite-agent user then reboot then. However I also found that by running
sudo mount_apfs disk2s6 /nix
sudo diskutil enableOwnership /nix
sudo chown -R buildkite-agent /nix
then reboot also fix the issue without security token.
buildkite's bootstrap shell behave's differently when run by launchctl then locally by user. Nix context seems to not fully inherited in daemon shell.
- Nix vol not auto mounting after boot seems to be fix by generating a security token for buildkite-agent user then reboot then. However I also found that by running
sudo mount_apfs disk2s6 /nix sudo diskutil enableOwnership /nix sudo chown -R buildkite-agent /nix
then reboot also fix the issue without security token.
Interesting. I do have some comments in my installer PR about enableOwnership.
Have you re-tried with my hosted installer, or is this still with the official one? I'm curious what the nix volume line in /etc/fstab says.
Will also ping you on IRC.
Summarize IRC chat from yesterday for documentation purpose
Installed --daemon install #4289 onto AMI, enabled buildkite-agent security token. Instances boot with this AMI behave as follow: nix volume still unmount. buildkite-agent security token is disabled again. After ssh into the machine, enable the token and reboot, seeing nix vol mount correctly.
nix seems to need FDA (@abathur is that right?). buildkite-agent runs directly from local login session via ssh seems to have some permission that buildkite-agent from launchd daemon bootstrap shell.
- nix seems to need FDA (@abathur is that right?). buildkite-agent runs directly from local login session via ssh seems to have some permission that buildkite-agent from launchd daemon bootstrap shell.
I'm not sure whether Nix does or doesn't in this context. My understanding is that a few people have worked around issues like this by adding an FDA exemption for /bin/sh (because the launchdaemon for nix-daemon uses /bin/sh).
The last comment in the other thread asked about whether you'd added the FDA exemption for buildkite-agent
. I'm not sure if that is a documented expectation on their end or not. The exemption should propagate to some degree, so I'd try that one first.
I had some thoughts late yesterday about removing/replacing the remaining homedir references from your launchd plist. I'm curious if you did try that (don't feel obliged, mainly wondering if we should follow up on that possibility).
I had some thoughts late yesterday about removing/replacing the remaining homedir references from your launchd plist. I'm curious if you did try that (don't feel obliged, mainly wondering if we should follow up on that possibility).
do you mean this or something else? I am willing to test drive
current state of the plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<!--
A launchd config for loading buildkite-agent on system boot on OS X
systems, and runs in GUI mode (which allows Xcode UI testing but requires
the user to login)
-->
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.buildkite.buildkite-agent</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/buildkite-agent</string>
<string>start</string>
</array>
<key>KeepAlive</key>
<dict>
<key>SuccessfulExit</key>
<false/>
</dict>
<key>RunAtLoad</key>
<true/>
<key>ProcessType</key>
<string>Interactive</string>
<key>UserName</key>
<string>buildkite-agent</string>
<key>ThrottleInterval</key>
<integer>30</integer>
<key>StandardOutPath</key>
<string>/usr/local/var/log/buildkite-agent.log</string>
<key>StandardErrorPath</key>
<string>/usr/local/var/log/buildkite-agent.error.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/nix/var/nix/profiles/default/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin</string>
<key>BUILDKITE_AGENT_CONFIG</key>
<string>/usr/local/etc/buildkite-agent/buildkite-agent.cfg</string>
<key>USER</key>
<string>buildkite-agent</string>
<key>AWS_REGION</key>
<string>us-east-1</string>
<key>NIX_PROFILES</key>
<string>/nix/var/nix/profiles/default</string>
<key>NIX_SSL_CERT_FILE</key>
<string>/nix/var/nix/profiles/default/etc/ssl/certs/ca-bundle.crt</string>
</dict>
</dict>
</plist>
@klardotsh I think we've exhausted out current ideas for getting this to work without the full-disk-access security exemption--do you remember what hoops you needed to jump to get a VNC session?
Here is the instruction on how to get a VNC session to mac1.metal instances https://gist.github.com/sebsto/6af5bf3acaf25c00dd938c3bbe722cc1
Those instructions look roughly like what I followed (which was https://www.lets-talk-about.tech/2020/12/aws-create-macos-desktop.html), so they should work. I've sadly been juggling a lot of things so haven't dived too far into the Nix-on-Mac rabbit hole lately (a coworker got Nix working on his Mac with far fewer issues than on my EC2 instance, so it fell down the priority list a bit)
I marked this as stale due to inactivity. → More info
I also ran into build issues due to my /nix being owned by root. I'm not sure if it was initially owned by my user, because I didn't check when I installed it.
I am on an M1 MacBook Pro (not a VM).
I will try to reinstall and see if the owner is still root.
Describe the bug
On aws Mac ec2 instance running Catalina 10.15.7 installed nix with recommended approach
sh <(curl -L https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume
works great. you can see /nix is mounted
and /nix is own by ec2-user
Problem however when I reboot the nix vol didn't auto mount (maybe
/etc/fstab
is no longer used by Catalina?) and/nix
is now own byroot
I can get around it by
sudo mount_apfs disk2s6 /nix
but I am using these mac ec2 instance for CI purpose and the process would faildue to
/Users/ec2-user/.nix-profile/etc/profile.d/nix.sh: Operation not permitted
Steps To Reproduce
described above
Expected behavior
nix vol mounted when boot and /nix owned by user who executed the install scripted
nix-env --version
outputnix (Nix) 2.3.10
Additional context
I am running these on aws ec2 Mac1.metal instances