feat: allow pod spec to be defined and patched in agent and plugin

42atomys commented 6 months ago

Description

This pull request introduces a new feature where the PodSpec can be defined and patched dynamically, allowing for more customized configurations of the Kubernetes pods that run the Buildkite agents.

This includes the ability to specify any pod specifitations directly through the Buildkite agent's configuration or inside the plugins with the logic of strategic merge logic used in kubectl.

Execution Logic

PodSpec are resolved without changes
The agent resolve the pod-spec-patch in agent configuration file and apply it on all jobs executed by him.
The agent resolve kubernetes plugin pod-spec-path field defined in pipeline step and apply it above.

Example

This agent configuration change the service account name, automount the service account and add a env var without erase env var already present.

# ...
# ssh-credentials-secret are the secret with the ssh credentials
# configured on the agent with docker-ssh-config
# @see https://github.com/buildkite/docker-ssh-env-config
ssh-credentials-secret: buildkite-ssh-github-creds
pod-spec-patch:
  serviceAccountName: 'buildkite-agent-sa'
  automountServiceAccountToken: true
  containers:
    - name: container-0
      env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: github-secrets
              key: github-token

Related Issues

Resolve #247

42atomys commented 6 months ago

reviewers : Let me know if you have question, I need help on the tests part 🙏

42atomys commented 5 months ago

Hi @triarius, thank you for your time. I have an example of usage that is implemented internally on a large scale, across thousands of jobs per day. We deploy one agent per cluster using a helm chart, with this configuration (we've been using the fork for 3 weeks now) to bind the service account and define resources at the agent level with our default values.

Note: Not defining resources in a Kubernetes pod results in a low priority assignment and scheduling of all jobs on the same node, causing latency and job crashes when many jobs are running concurrently.

Agent configuration (helm chart):

    image: ghcr.io/42atomys/helm/agent-stack-k8s/controller:0.8.0
    config:
      tags: [ queue="default" ]
      ssh-credentials-secret: github-credentials
      pod-spec-patch:
        serviceAccountName: buildkite-service-account
        automountServiceAccountToken: true
          initContainers:
            - name: copy-agent
              resources:
                requests:
                  cpu: 100m
                  memory: 96Mi
                limits:
                  memory: 96Mi
          containers:
            - name: container-0
               resources:
                 requests:
                   cpu: 200m
                   memory: 256Mi
                 limits:
                   memory: 256Mi
            - name: dind
              resources:
                requests:
                  cpu: 100m
                  memory: 336Mi
                limits:
                  memory: 336Mi
            - name: checkout
              resources:
                requests:
                  cpu: 100m
                  memory: 480Mi
                limits:
                  memory: 480Mi
            - name: agent
              resources:
                requests:
                  cpu: 50m
                  memory: 64Mi
                limits:
                  memory: 64Mi

In the step definition, we sometimes need "more power" to execute tasks, like integration tests with a browserless engine. With pod-spec-patch at the step level, we can granularly configure our pipeline without having to redefine all the pod/container definitions. See the example below:

Simple CI Pipeline:

 steps:
  # Execute a step on a Kubernetes pod to run rspec with the BUILDKITE_ANALYTICS_TOKEN
  # secret injected into container-0. 
  # NOTE: We define the commands in the `commands` field, we dont declare
  # any container in the `podSpecPatch` field, so the plugin will use the default
  # container (container-0) to run the commands.
  - agents:
      queue: default
    commands:
    - bundle exec rspec
    plugins:
    - kubernetes:
        podSpecPatch:
          containers:
          - env:
            - name: BUILDKITE_ANALYTICS_TOKEN
              valueFrom:
                secretKeyRef:
                  key: BUILDKITE_ANALYTICS_TOKEN
                  name: buildkite
            name: container-0

  # Execute a playwritght test on a Kubernetes pod to run playwright with more
  # resources due to the high memory and CPU usage.
  - agents:
      queue: default
    commands:
    - bundle exec playwright test
    plugins:
    - kubernetes:
        podSpecPatch:
          containers:
          - name: container-0
            resources:
              limits:
                cpu: 2
                memory: 4Gi
              requests:
                cpu: 2
                memory: 4Gi

This is a simple example of how podSpecPatch can be used at the step level to adjust resources, but any field of the pod can be changed as well, without the need to define the entire pod spec each time. We are currently scaling Buildkite to an industrial level on my organisation

triarius commented 5 months ago

Thanks for the context @42atomys. I feel like with a PodSpecPatch at the step level, if you're only running one container in a job, you will almost never need to specify PodSpec at the step level. I'm happy to leave them both in for now, but we may decide to deprecate PodSpec later. The controller will always create a base PodSpec, but you can patch it in the plugin.

I've got the tests working on another branch. Are you happy for me to push to this branch and merge this PR? I'll be making it independent of #248 though. We can discuss the merits of step level sshCredentialSecrets there.

42atomys commented 5 months ago

@triarius Thanks for your return, I think deprecate PodSpec in the future must be done and rework the logic about job creation !

FYI: I dont know if you got it, I create a default container-0 when no container are defined to run commands fields of step definition without the need to define the kubernetes plugin 🎉 The idea behind are to be seamless with the normal behaviour without the kubernetes plugin.

You can merge your branch and merge it for me all are good 💯

buildkite / agent-stack-k8s