buildkite / agent-stack-k8s

Spin up an autoscaling stack of Buildkite Agents on Kubernetes
MIT License
77 stars 30 forks source link

"failure" job uses private image causing imagePullBackoff #273

Closed bpoland closed 1 month ago

bpoland commented 5 months ago

Hey, I'm running into a bug consistently but it's pretty weird :)

In my podspec I am using an imagePullSecrets section since I'm using a private image. It looks something like this (I've removed some extra sections that don't seem relevant, but let me know if you can't reproduce and I can share more details):

  - label: Build
    plugins:
      - kubernetes:
          gitEnvFrom:
            - secretRef:
                name: buildkite-agent-ssh
          podSpec:
            imagePullSecrets:
              - name: image-pull-secret
            containers:
              - image: <my-private-image>
                command:
                  - |
                    echo "+++ build"
                    <my build command>
            volumes:
              - name: agent-hooks
                configMap:
                  name: buildkite-agent-hooks
                  defaultMode: 0755
          extraVolumeMounts:
            - name: agent-hooks
              mountPath: /buildkite/hooks

This worked fine and created a pod with the correct imagePullSecrets. But, we use a yaml linter that told me to replace the defaultMode: 0755 with defaultMode: "0755" and when I did that, the pod created by agent-stack-k8s was now missing the imagePullSecrets and it failed to pull the image :(

I don't understand why adding quotes would have broken this but hopefully you can reproduce and fix it! :)

bpoland commented 5 months ago

And just to be clear, adding the quotes causes the resulting pod to be completely missing the imagePullSecrets block...

triarius commented 5 months ago

Thanks for raising this @bpoland,

I think the linter is giving you some suspicious advice. The defaultMode field of a config map projection in a volume is supposed to have an int32 type, so 0755 should be the value rather than "0755", which is a string type. Maybe the linter is just hypercorrecting for the fact that 0755 is octal notation, i.e. 493 in decimal. But the k8s docs (and POSIX conventions) explicitly allow for file permission to be specified as octals, so I think it's preferable to leave it as 0755. I think YAML 1.2 supports 0o755 as a more explicit octal notation, maybe your linter won't complain about it then? Hopefully the layers of parsing that the pipeline YAML goes through allows for that octal notation too. If not, we can look into fixing that.

As for the imagePullSecrets are being swallowed, that could be a bug. What I expect to happen if some of the podSpec has an invalid type is that the job does not run at all and instead a different job that just echos a warning and fails is run. This is done to surface errors in the Buildkite UI that would otherwise only be present in the controller's k8s logs. The pod in that job will NOT have the imagePullSecrets, or anything fancy you specified in the pipeline YAML. Is it possible that that's what you're seeing?

bpoland commented 5 months ago

Ahh thanks @triarius you are exactly right. I see in the pod it's trying to start that this is the command it wants to run:

echo "failed parsing Kubernetes plugin: json: cannot unmarshal string
        into Go struct field ConfigMapVolumeSource.PodSpec.volumes.configMap.defaultMode
        of type int32" && exit 1

And all the other config is missing too, not just the image pull secrets. But I guess in this case because it can't pull the image, the result is just the job hanging in the Buildkite UI and not displaying that message haha

I tried the 0o755 method and that works but yamllint is still not happy with it, I can figure that part out though.

Maybe allowing the imagePullSecrets to be defined in the controller helm chart when specifying a custom image would solve the problem? Then that message would be displayed in the Buildkite UI as intended :)

triarius commented 5 months ago

Yeah there are a few issues that prevent this error message container from starting. This and #233 are known. I have a plan to address them both in a single change.

triarius commented 5 months ago

Ok so I've decided NOT to copy over the image pull secrets to this "failure job" in #275. Here's the rationale I put in the PR description:

... transferring things like gitEnvFrom and imagePullSecret [is not] going to work if there was an error in the specification of those fields. We want this failure job to be as bare bones as possible to reduce the chance of errors in the Pipeline YAML from preventing it from running.

The only container image that would be needed for the Pod of the "failure job" to run are the buildkite-agent image that's specified in the controller config. The default value for this is ghcr.io/buildkite/agent-stack-k8s/agent:<agent-stack-k8s-version> which is public, so shouldn't need the image pull secrets.

bpoland commented 5 months ago

The only container image that would be needed for the Pod of the "failure job" to run are the buildkite-agent image that's specified in the controller config. The default value for this is ghcr.io/buildkite/agent-stack-k8s/agent:<agent-stack-k8s-version> which is public, so shouldn't need the image pull secrets.

So are you saying that even if we've specified a custom agent image, the error would use the standard/default agent image so it wouldn't need a pull secret? That would make sense to me, although I don't see anything related to that in the PR (my Go is a little rusty though haha)

I understand about wanting to handle the case where the imagePullSecrets are not defined properly in the pipeline, but I think if you allowed specifying that in the helm chart when providing a custom agent image then you could ensure that it is not malformed?

triarius commented 5 months ago

So are you saying that even if we've specified a custom agent image, the error would use the standard/default agent image so it wouldn't need a pull secret? That would make sense to me, although I don't see anything related to that in the PR (my Go is a little rusty though haha)

Good catch, it's not in the diff because it was already doing that :)

I understand about wanting to handle the case where the imagePullSecrets are not defined properly in the pipeline, but I think if you allowed specifying that in the helm chart when providing a custom agent image then you could ensure that it is not malformed?

Once https://github.com/buildkite/agent-stack-k8s/pull/262 is merged you will be able to do that and more!

I want to have the behaviour that the controller level patches are always applied so that if they are incorrect, all jobs would be affected. I expect they will mainly be tweaked when setting up the stack and the person doing that would be able to debug them with kubectl.

bpoland commented 5 months ago

Good catch, it's not in the diff because it was already doing that :)

Interesting, in my case the pod was in image pull backoff because it was still trying to pull my custom agent image. So I don't think it was quite working right, but I'm fine to wait until some of these other changes are in. #262 definitely looks interesting!

triarius commented 5 months ago

Interesting, in my case the pod was in image pull backoff because it was still trying to pull my custom agent image. So I don't think it was quite working right, but I'm fine to wait until some of these other changes are in. #262 definitely looks interesting!

Ah, ok, I got mixed up between various levels of "default". So you're setting the custom image in a private repo as a value for the helm chart. Yeah, this will still try to pull that and encounter image pull backoff. I see now why it's a problem.

I think you may have intended to nudge me to use the "default value of the default image". I'll do that in my PR.

bpoland commented 5 months ago

Yep you got it. That sounds great, thanks!

DrJosh9000 commented 5 months ago

I believe with #282 merged and released, this is resolved?

bpoland commented 5 months ago

Thanks, yes specifically I think this commit fixes it :)

https://github.com/buildkite/agent-stack-k8s/commit/945540c8faefd04e2d7c570e3ed358d85e8a00fd

bpoland commented 3 months ago

👋 unfortunately this issue seems to have come back -- seeing today that an invalid podspec causes agent-stack-k8s (version 0.11.0) to try creating a pod with the custom agent image we are specifying as a helm value, and not the publicly-available default image. That fails to start because our custom image requires authentication to pull and no image pull secrets get specified, so it enters image pull backoff :(

DrJosh9000 commented 3 months ago

Hi @bpoland! Not sure what's broken here. The one thing I can think of that might have impacted it was a change to the default agent image in #304 from ghcr.io/buildkite/agent-stack-k8s/agent:latest to "ghcr.io/buildkite/agent:<specific version>.

bpoland commented 2 months ago

👋 hey just an update, we are still seeing this issue. Anything I can do to help narrow down the cause?

DrJosh9000 commented 1 month ago

Hey, sorry for the radio silence. I have a plan for eliminating the need for running a whole pod just to report failure to BK. Watch this space 📺