New versions of agent require re-authorisation of macOS system permissions

nick-f commented 1 year ago

(I originally posted https://github.com/buildkite/homebrew-buildkite/issues/26 but the more I think about it, the more it's not a Homebrew-specific issue and it's an agent issue.)

Background

Every time there is a new version of buildkite-agent installed on macOS it appears as a different app and needs to have permissions (like accessing a network volume) re-approved.

The re-authorisation process cannot* be automated as it is a macOS security feature.

(* Well, there are a couple of workarounds. One is enrolling and managing a device using MDM, which allows for profiles to be added to authorise permissions. The other is to disable System Integrity Protection (SIP) which is not ideal either.)

This involves screen sharing to each Mac and clicking a button that shows up when the first test runs. macOS stores this authorisation in the TCC database, which is read-only unless SIP is disabled.

To see the entries in the TCC database, run:

sqlite3 ~/Library/Application\ Support/com.apple.TCC/TCC.db "select * from access;"

A good explainer of TCC can be found here: https://www.rainforestqa.com/blog/macos-tcc-db-deep-dive.

This is occurring on a bunch of Mac Minis running macOS 12.6.2 (21G320), which is the most up to date public version of macOS.

Steps to reproduce

Install a version of buildkite-agent (such as 3.38.0).
Run a Buildkite job that connects to a network volume from the host.
Click the approval dialog that shows up to grant buildkite-agent access to the network volume or other directory requiring permissions. (See https://github.com/buildkite/agent/issues/1922#issuecomment-1396313047 for a minimal pipeline and script to replicate)

See that the entry is listed in the TCC database (note that the path to buildkite-agent is a full path including the version number, not the symlink that would come from running something like which buildkite-agent)

sqlite3 ~/Library/Application\ Support/com.apple.TCC/TCC.db "select * from access;"

# Truncated sample output. Other apps may also appear here if you've granted them permission to, say, use the camera or microphone
kTCCServiceSystemPolicyNetworkVolumes|/opt/homebrew/Cellar/buildkite-agent/3.38.0/bin/buildkite-agent|1|2|2|1||||UNUSED||0|1666063609

Install a new version of buildkite-agent and repeat step 2.

Expected behaviour

buildkite-agent will have access to previously approved permissions.

Actual behaviour

Each version is seen as a completely new program and needs to have permissions granted again.

moskyb commented 1 year ago

sorry for taking so long to get back to you on this one @nick-f! if it's any consolation, we've been ruminating on it a bit in the background.

🤔 i wonder if this is a buildkite agent problem or a brew problem (or, more likely, a "the way we ship the buildkite agent on brew" problem).

My (uninformed, i'm extremely not a mac developer) hypothesis is that if the the agent binary we shipped through brew was in a stable location across updates (ie, if we stored the agent in /opt/homebrew/Cellar/buildkite-agent/bin/buildkite-agent rather than /opt/homebrew/Cellar/buildkite-agent/3.38.0/bin/buildkite-agent), then theoretically, the path that stores the service in the TCC database would remain stable, and macOS wouldn't have to re-prompt for permissions whenever an update rolls around.

i'm having trouble reproducing this locally, however, mostly because i don't have any network volumes to connect 😅. Would you be able to send through a minimal reproducing pipeline yaml so i can look into this a bit further?

nick-f commented 1 year ago

No problem at all @moskyb 😃

I've also been thinking about it and came up with a similar thought this morning to what you suggested about the static path. My thought was basically going to be copying the buildkite-agent to a static directory and updating our LaunchAgents to run the agent from there instead of the homebrew-installed location. But that only fixes it for us, not for everyone.

Good news is that I've come up with some easier to reproduce replication steps too. macOS also prompts for access to the Desktop, Documents, and Downloads, so they can be used in place of Network mounts.

Pipeline:

steps:
  - label: ':file_cabinet: macOS permissions'
    command: .buildkite/steps/test_mounted_network_drive_access.sh
    timeout_in_minutes: 2
    agents:
      queue: 'your-awesome-queue-name'

and then test_mounted_network_drive_access.sh:

#!/bin/bash

set -euo pipefail

mounted_directory="$HOME/Desktop"

echo "--- Checking access to $mounted_directory"

ls -al "$mounted_directory"

Once that runs, a prompt shows on the host like this:

Let me know if you need anything else 🙏

moskyb commented 1 year ago

hey @nick-f! thanks for the further detail, i really appreciate it.

i suspect that this might not be an issue with the buildkite agent or its brew config, but with Brew itself :/ (though i'd be very happy for this to be disproved)

the symlinking from a stable place to the versioned location is actually already what brew does:

$ brew install buildkite-agent
# ... 
$ which buildkite-agent
/opt/homebrew/bin/buildkite-agent
$ ls -la $(which buildkite-agent)
lrwxr-xr-x 52 ben  7 Mar 09:46 /opt/homebrew/bin/buildkite-agent -> ../Cellar/buildkite-agent/3.44.0/bin/buildkite-agent

... which implies that macOS automatically dereferences symlinks when figuring out TCC stuff, as it's the versioned path that shows up in the TCC database. This is very annoying for our purposes.

i'd be very interested to know if/how other mac devs deal with this issue - as mentioned above, it's not my particular area of expertise, and if there are any mac people lurking, i'd love to see your reckons :)

i think your solution of "install it, then move it to a stable location" is probably going to be the easiest solve in this case, though i agree that it doesn't solve it for everyone.

one interesting thing to note is that using the latest buildkite-agent from homebrew (v3.44.0 as above), i can't actually reproduce this issue using the steps above - there's no entry for buildkite agent in my TCC DB at all, and the step runs just fine, outputting the contents of my desktop.

my hypothesis is that because i'm running it from iterm2, which i've given full disk perms, that permissioning gets passed down to child processes? it might be an alternative avenue to explore for not having to reprompt all the time.

nick-f commented 1 year ago

I've been working on this (and trying to understand the issue more) away from GitHub but I'm bringing the conversation back here now.

@moskyb Yeah, running buildkite-agent in iTerm will mean that any program will inherit iTerm's permissions. If a LaunchAgent is created and loaded to run buildkite-agent then it should show up in System Settings.

Save the plist below to ~/Library/LaunchAgents/buildkite-agent.plist and load it by running launchctl load -w /Users/<your username>/Library/LaunchAgents/buildkite-agent.plist

Also update the paths everywhere /Users/nick-f/ is to be your username, not mine 😄

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>buildkite-agent</string>

  <key>WorkingDirectory</key>
  <string>/Users/nick-f/.buildkite-agent/bin</string>

  <key>ProgramArguments</key>
  <array>
    <string>/Users/nick-f/.buildkite-agent/bin/buildkite-agent</string>
    <string>start</string>
    <string>--config</string>
    <string>/Users/nick-f/.buildkite-agent/buildkite-agent.cfg</string>
  </array>

  <key>EnvironmentVariables</key>
  <dict>
    <key>PATH</key>
    <string>/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
  </dict>

  <key>RunAtLoad</key>
  <true/>

  <key>KeepAlive</key>
  <dict>
    <key>SuccessfulExit</key>
    <false/>
  </dict>

  <key>ProcessType</key>
  <string>Interactive</string>

  <key>ThrottleInterval</key>
  <integer>30</integer>

  <key>StandardOutPath</key>
  <string>/Users/nick-f/.buildkite-agent/buildkite-agent.log</string>

  <key>StandardErrorPath</key>
  <string>/Users/nick-f/.buildkite-agent/buildkite-agent.err.log</string>
</dict>
</plist>

As far as actually resolving the issue, it can be fixed by having the Buildkite agent be code signed.

I've written a bash script to help simplify this process. https://github.com/nick-f/macos-binary-signer/blob/main/buildkite-agent/run.bash

If anyone is going to use this, be sure to read the repo README and the /buildkite-agent README for important notes. There's some shenanigans that have to happen like putting the signed binary in a .pkg and distributing the .pkg to hosts.

I've tested that the code signing works with newer versions of the agent (like 3.46.0 which just came out) and the permissions are retained because the binary has the same bundle ID and is signed by the same account. 🎉

Ideally this would be done as part of the Buildkite distribution process, so we don't have to sign the package ourselves. Because it's being signed by my certificate I won't distribute the package but the script provided above should help anyone else facing the same problem.

Some great resources that helped me when trying to work this out:

rfay commented 4 months ago

This is a major problem every time buildkite gets updated with brew upgrade. Tests block waiting for many dialogs like this:

buildkite / agent