buildkite / agent-stack-k8s

Spin up an autoscaling stack of Buildkite Agents on Kubernetes
MIT License
77 stars 27 forks source link

Can't use non-root user with podSpecPatch #326

Open bpoland opened 2 months ago

bpoland commented 2 months ago

Hi, I'm trying to use the new functionality introduced with https://github.com/buildkite/agent-stack-k8s/pull/282 but running into some issues with checkout. I was initially seeing the checkout container error at startup with this message logged:

mkdir: can't create directory '//.ssh': Permission denied

I compared the resulting pod spec from before and after and noticed that previously this logic seems to have been triggered to create a non-root user and it looks like that's no longer happening when using podSpecPatch.

I tried manually adding the root securityContext in my podSpecPatch but that causes other permissions errors after checkout, since it's trying to run as a non-root user later. It seems like the above linked logic needs to be triggered even when using podSpecPatch?

moskyb commented 2 months ago

yo! @DrJosh9000 is gonna take a look at this for you, will get back to you soon.

DrJosh9000 commented 2 months ago

I'm not sure I understand exactly how this is falling down, but I'll post my debugging so far anyway. Quick recap - the scheduler does this:

  1. Initialise a podSpec using the one from the kubernetes "plugin", or a default that runs command
  2. ...a variety of unrelated(?) nonsense...
  3. Adds on the other containers, including checkout if enabled...
    • ...which inspects the securityContext in the podSpec so far. Based on whether runAsUser and runAsGroup are nonzero, it runs the extra logic to create the user/group in the checkout container before checking out
  4. ...more unrelated nonsense (the init container)...
  5. Apply podSpecPatch from config (helm values), then from the "plugin"

So, if the securityContext is supplied by podSpecPatch and not by podSpec, it won't be seen by createCheckoutContainer, so that logic won't be run.

Here's a pipeline that ran for me:

agents:
  queue: kubernetes
steps:
  - label: ":pipeline:"
    plugins:
      - kubernetes:
          podSpec:
            securityContext:
              runAsUser: 2000
              runAsGroup: 2000
            containers:
              - name: "command"
                image: "alpine:latest"
                command: ["ash -c"]
                args: ["'whoami ; id ; ls -l'"]
          podSpecPatch:
            containers:
              - name: "checkout"

which worked:

~~~ Preparing working directory
...snip...
~~~ Running commands 
$ ash -c 'whoami ; id ; ls -l'
whoami: unknown uid 2000
uid=2000 gid=2000 groups=2000
total 4
-rw-r--r--    1 2000     2000            32 May 22 06:01 README.md

But I guess that's not ideal - you have to go back to specifying a bunch of things in podSpec, bringing back a bunch of the pipeline duplication.

My off-the-top-of-my-head solution would be: instead of conditioning the user/group creation logic on what the security context is at schedule time, the checkout container could compare current and intended UID/GID at run time. (Perhaps the command container should be doing the same thing.)

bpoland commented 2 months ago

Thanks for the context. I will see if I can get it working on my side like you described. As you said, not ideal but at least I could eliminate some of the other duplication. Just to clarify though, is the podSpecPatch containers entry for checkout needed in order for the securityContext to be applied to the checkout container?

The solution you outlined does make sense to me, but I guess that would be a longer term thing eh?

DrJosh9000 commented 2 months ago

I was originally testing to see if the strategic merge patch used by podSpecPatch removed the securityContext, before realising that it was an ordering problem. So that part is probably not needed. 🤞

As for timeline, I think it diminishes the usefulness of PodSpecPatch to have to work around it like that, so I hope to take a stab at a solution sooner rather than later.