actions / runner-container-hooks

Runner Container Hooks for GitHub Actions
MIT License
69 stars 44 forks source link

feat: add functionality to provide optional pod template #50

Closed nielstenboom closed 11 months ago

nielstenboom commented 1 year ago

Hi there!

This PR contains changes to support a new env variable ACTIONS_RUNNER_POD_TEMPLATE_PATH which you can use to point to a template pod spec. This template can be used to enrich the created pod with all the fields that are required. We personally use it to set a bunch of default env variables on every job pod and to mount cache volumes into it. But it can also be used to set securityContext or serviceaccountName for example.

I've chosen to solve this using a template file because I think this will require the least amount of changes to implement this upstream in https://github.com/actions/actions-runner-controller. Otherwise you would have to deal with passing the values of every field (serviceaccount, resources, mounts, etc. etc.) separately.

It uses the lodash mergeWith function to merge the template with the pod spec. It concatenates lists, so you can add extra env variables or volumemounts without overwriting the ones added by this project.

Please let me know what you think! I'd be happy to provide more info or write more tests if you think this is required. Thanks!

Fixes https://github.com/actions/runner-container-hooks/issues/46 & https://github.com/actions/runner-container-hooks/issues/33

nikola-jokic commented 1 year ago

Hey @nielstenboom,

Thank you for submitting this PR! We are currently looking at how we can use workflow inputs to help with both of these issues. I will leave this PR open as a reference, but we may take a different direction here :relaxed:

nielstenboom commented 1 year ago

Hey @nielstenboom,

Thank you for submitting this PR! We are currently looking at how we can use workflow inputs to help with both of these issues. I will leave this PR open as a reference, but we may take a different direction here ☺️

No worries. Internally we temporarily will keep running this fork until you guys made the changes. Looking forward to the fix! 😄

MichaelHudgins commented 1 year ago

Hey @nielstenboom,

Thank you for submitting this PR! We are currently looking at how we can use workflow inputs to help with both of these issues. I will leave this PR open as a reference, but we may take a different direction here ☺️

Any update on a path forward on this? My use case mainly surrounds being able to set resources on the workflow pod. Right now I don't see an easy way to set a CPU request or bind a GPU to the pod.

nikola-jokic commented 1 year ago

Hey @MichaelHudgins,

No updates yet, but it is on our backlog and will definitely be worked on in the near future.

sdarwin commented 1 year ago

Hi, Experimenting with https://github.com/actions/actions-runner-controller , I fairly quickly ran into a roadblock that the job pods have insufficient memory. All the jobs crash. So how do you set Resources on the workflow job pods? The github issue discussing it leads back to this pull request. Looks like an important bug to solve.

nielstenboom commented 1 year ago

We run the fork internally as follows:

  1. Create a template.yaml file with the fields you'd like to set, we use the following template but you can also set the resources or any other field.

    apiVersion: v1
    kind: Pod
    metadata:
    name: thisvaluewillbeignored
    spec:
    containers:
    # the runner will override the "image", "name" and "command" fields
    - image: "test/test"
      name: "thisvaluewillbeignored"
      command:
        - "these"
        - "are"
        - "overridden"
    
      env:
       - name: POETRY_CACHE_DIR
          value: "/ci-cache/poetry"
    
      volumeMounts:
      - name: ci-cache
        mountPath: /ci-cache
    
    volumes:
    - name: ci-cache
    persistentVolumeClaim:
      claimName: ci-cache
  2. Build your own runners with the following Dockerfile:
FROM summerwind/actions-runner:v2.299.1-ubuntu-20.04

ARG RUNNER_CONTAINER_HOOKS_VERSION=0.3.5

RUN cd "$RUNNER_ASSETS_DIR" \
    && sudo rm -rf ./k8s && pwd \
    && curl -fLo runner-container-hooks.zip https://github.com/nielstenboom/runner-container-hooks/releases/download/v${RUNNER_CONTAINER_HOOKS_VERSION}/actions-runner-hooks-k8s-${RUNNER_CONTAINER_HOOKS_VERSION}.zip \
    && unzip ./runner-container-hooks.zip -d ./k8s \
    && rm -f runner-container-hooks.zip

COPY template.yaml /home/runner/template.yaml
ENV ACTIONS_RUNNER_POD_TEMPLATE_PATH="/home/runner/template.yaml"

USER runner

ENTRYPOINT ["/bin/bash", "-c"]
CMD ["entrypoint.sh"]
  1. Set the image.actionsRunnerRepositoryAndTag to your internally built image tag in the helm chart. If all is well then your new pods should have the fields set from the template.

Good luck!

sdarwin commented 1 year ago

Thanks Niels. I temporarily got around the cpu/mem resource issue another way, by setting a LimitRange on the namespace. The next issue is GKE Autopilot is very conservative about the size and quantity of nodes it launches. The minimum. When a job pod with a lot of cpu/mem is provisioned, what happens? What should happen is that the HPA autoscaler adds more nodes. However, what actually happens is a sudden "OutofCPU" error, and the job fails.

nielstenboom commented 1 year ago

Thanks Niels. I temporarily got around the cpu/mem resource issue another way, by setting a LimitRange on the namespace. The next issue is GKE Autopilot is very conservative about the size and quantity of nodes it launches. The minimum. When a job pod with a lot of cpu/mem is provisioned, what happens? What should happen is that the HPA autoscaler adds more nodes. However, what actually happens is a sudden "OutofCPU" error, and the job fails.

Maybe in your case it could already work if you set the resources on the runner pods itself since the job pods are forced onto the exact same node (see code here)?

sdarwin commented 1 year ago

@nielstenboom very interesting! Rather than distract too much in this pull request, I created a new discussion thread about the outofcpu errors.

jaimehrubiks commented 1 year ago

Added some related discussion here: https://github.com/actions/actions-runner-controller/discussions/2592

If this doesn't move quick, I'm planning to write a "mutating webhook admission controller" which would intercept the "-workload" pods, and will read the env variables (because there is no way to set labels) and reconfigure the pod spec dynamically before the pod is actually created. So for example, this github actions job definition:

jobs:
  # This workflow contains a single job called "build"
  build:
    container:
      image: node:14.16
      env:
        NODE_ENV: development
        RUNNER_NODE_SELECTORS: "node.kubernetes.io/instance-type: g4dn.xlarge"
        RUNNER_TOLERATIONS: "[{ key: cccis.com/karpenter, operator: Exists, effect: NoSchedule}]"
        RUNNER_LABELS: "owner: jaime, type: job"
        RUNNER_CPU_REQUESTS: "1"
        RUNNER_GPUS: "1"

Would mutate the pod and inject fields like:

    nodeSelector:
      node.kubernetes.io/instance-type: g4dn.xlarge
    tolerations:
    - key: cccis.com/karpenter
      operator: Exists
      effect: NoSchedule
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          memory: 2Gi
          nvidia.com/gpu: 1

And the others

But not sure if I'll have time to do that, but should not be complicated and might be the best (because it is the simplest) for our use case.

nikola-jokic commented 1 year ago

Hey @jaimehrubiks,

The work is mostly done. You can refer to this PR. Basically, we will extend the yaml job definition for you to specify the pod spec in your workflow yaml file. We just need to finalize the interface (the way you will provide a definition in a workflow). But if that implementation does not work for you, you can choose to implement it however you would like :relaxed:. The intention of the hook was to allow you to customize behaviour.

jaimehrubiks commented 1 year ago

Added some related discussion here: actions/actions-runner-controller#2592

If this doesn't move quick, I'm planning to write a "mutating webhook admission controller" which would intercept the "-workload" pods, and will read the env variables (because there is no way to set labels) and reconfigure the pod spec dynamically before the pod is actually created. So for example, this github actions job definition:

jobs:
  # This workflow contains a single job called "build"
  build:
    container:
      image: node:14.16
      env:
        NODE_ENV: development
        RUNNER_NODE_SELECTORS: "node.kubernetes.io/instance-type: g4dn.xlarge"
        RUNNER_TOLERATIONS: "[{ key: cccis.com/karpenter, operator: Exists, effect: NoSchedule}]"
        RUNNER_LABELS: "owner: jaime, type: job"
        RUNNER_CPU_REQUESTS: "1"
        RUNNER_GPUS: "1"

Would mutate the pod and inject fields like:

    nodeSelector:
      node.kubernetes.io/instance-type: g4dn.xlarge
    tolerations:
    - key: cccis.com/karpenter
      operator: Exists
      effect: NoSchedule
      resources:
        requests:
          cpu: 1
          memory: 1Gi
          nvidia.com/gpu: 1
        limits:
          memory: 2Gi
          nvidia.com/gpu: 1

And the others

But not sure if I'll have time to do that, but should not be complicated and might be the best (because it is the simplest) for our use case.

I'll wait until the functionality is added, because my approach cannot be done. There is nothing on the pod template apart from the image, I thought the env variables would be there but they are read from files shared with the main runner pod in a volume somewhere else, they are not in the pod template itself.

Nevermind, it is actually possible if you specify the variable in this specific location:

jobs.<job_id>.container.env

It won't be added to the pod spec if added in these other 3 locations:

env
jobs.<job_id>.env
jobs.<job_id>.steps[*].env

(And of course "services" as those don't work on kube right now)

nielstenboom commented 11 months ago

Closing this one in favor of https://github.com/actions/runner-container-hooks/pull/75

isker commented 5 months ago

We just need to finalize the interface (the way you will provide a definition in a workflow)

Hi @nikola-jokic, as far as I can tell there is no workflow support for this yet. Is there any issue or PR I can watch to follow this? I'd like to set requests/limits per-job in the kind of way that this issue proposed.