actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.56k stars 1.07k forks source link

Cannot pass nodeSelector, tolarations and resources in containerMode: kubernetes #1730

Open aacecandev opened 2 years ago

aacecandev commented 2 years ago

Controller Version

0.25.2

Helm Chart Version

0.20.2

CertManager Version

1.9.1

Deployment Method

Helm

cert-manager installation

Cert-manager is installed using helmfile

helm/
  cert-manager/
    values.yaml
helmfile.yaml

Contents of values.yaml

installCRDs: true

Contents of helmfile.yaml

helmDefaults:
  createNamespace: true
  atomic: true
  verify: false
  wait: true
  timeout: 1200
  recreatePods: true
  disableValidation: true

repositories:
  - name: github
    url: https://actions-runner-controller.github.io/actions-runner-controller
  - name: "incubator"
    url: "https://charts.helm.sh/incubator"
  - name: jetstack
    url: https://charts.jetstack.io

templates:
  default: &default
    namespace: kube-system
    missingFileHandler: Warn
    values:
    - helm/{{`{{ .Release.Name }}`}}/values.yaml
    secrets:
    - helm/{{`{{ .Release.Name }}`}}/secrets.yaml

releases:
  - name: cert-manager
    <<: *default
    namespace: cert-manager
    chart: jetstack/cert-manager
    version: v1.9.1

Then install it executing

helmfile apply

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-gpu-k8s
spec:
  replicas: 1
  template:
    spec:
      image: summerwind/actions-runner:latest
      nodeSelector:
        cloud.google.com/gke-gpu-partition-size: 1g.5gb
      tolerations:
        - effect: "NoSchedule"
          key: "konstellation.io/gpu"
          operator: "Equal"
          value: "true"
        - effect: "NoSchedule"
          key: "nvidia.com/gpu"
          operator: "Equal"
          value: "present"
      serviceAccountName: "gh-runner-service-account"
      labels:
        - self-hosted
        - gpu
        - k8s
      repository: <organization/repository>
      containerMode: kubernetes
      dockerdWithinRunnerContainer: false
      dockerEnabled: false
      workVolumeClaimTemplate:
        storageClassName: "standard"
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
      resources:
        limits:
          nvidia.com/gpu: 1
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner-autoscaler-gpu-k8s
spec:
  minReplicas: 0
  maxReplicas: 1
  scaleTargetRef:
    name: runner-gpu-k8s
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      amount: 1
      duration: "5m"

To Reproduce

1. Deploy Role, RoleBinding, ServiceAccount
2. Deploy Controller
3. Deploy RunnerDeployment
4. Deploy HorizontalRunnerAutoscaler
5. Launch a Github Actions Workflow using a GPU base Docker image

Example workflow

name: GitHub Actions Example GPU - K8S

on: [push, workflow_dispatch]

jobs:
  Explore-GitHub-Actions:
    runs-on: ["self-hosted", "gpu", "k8s"]
    container: 
      image: nvidia/cuda:11.0.3-base-ubi7
    steps:
      - name: Check out repository code
        uses: actions/checkout@v3
      - run: sleep 300 # For debugging purposes, the actual behavior expected is defined below
      - run: ls -ltR /usr/local/nvidia
      - run: /usr/local/nvidia/bin/nvidia-smi -L
      - run: /usr/local/nvidia/bin/nvidia-smi

Describe the bug

Once everything has been deployed, a runner pod is created in the GPU node. This pod has the correct:

Once a while, the workflow is launched in a new pod, but this pod doesn't contains any of the above fields in its manifest, so I don't have GPU resources, binaries, etc mounted in the pod

Describe the expected behavior

It is expected that the pod that is running the actual workflow inherits or can be configured in such a way that:

these fields are specified in the pod allowing me to schedule the pod in a GPU-enabled node, and to configure the resources so GKE NVIDIA device-plugin can read the limits and pass through the GPU resources to the workflow pod.

Controller Logs

2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 1        {"horizontalrunnerautoscaler": "github-gpu-k8s/runner-autoscaler-gpu-k8s", "suggested": 0, "reserved": 1, "min": 0, "max": 1}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "7871fd26-9df4-4605-87e2-f02f7734d954", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "7871fd26-9df4-4605-87e2-f02f7734d954", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "b7df6cca-38ff-4d83-93df-3fb24c5d888f", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    INFO    runnerdeployment-resource       validate resource to be updated {"name": "runner-gpu-k8s"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "b7df6cca-38ff-4d83-93df-3fb24c5d888f", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 1        {"horizontalrunnerautoscaler": "github-gpu-k8s/runner-autoscaler-gpu-k8s", "suggested": 0, "reserved": 1, "min": 0, "max": 1, "last_scale_up_time": "2022-08-18 12:18:19 +0000 UTC", "scale_down_delay_until": "2022-08-18T12:28:19Z"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "935a04c9-b06f-44fd-9534-14480d8cb775", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "935a04c9-b06f-44fd-9534-14480d8cb775", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "1043ad94-4bf1-4edf-b7ad-e28d0a698765", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "1043ad94-4bf1-4edf-b7ad-e28d0a698765", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "7ed13cef-ac61-4ad9-b5d5-d30f15bfb7bc", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    INFO    runnerdeployment-resource       validate resource to be updated {"name": "runner-gpu-k8s"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "7ed13cef-ac61-4ad9-b5d5-d30f15bfb7bc", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "7926c086-c8a9-47df-993f-7ded60a637a7", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-08-18T12:18:19Z    INFO    runnerreplicaset-resource       validate resource to be updated {"name": "runner-gpu-k8s-vzpr9"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "7926c086-c8a9-47df-993f-7ded60a637a7", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.runnerdeployment      Updated runnerreplicaset due to spec change     {"runnerdeployment": "github-gpu-k8s/runner-gpu-k8s", "currentDesiredReplicas": 0, "newDesiredReplicas": 1, "currentEffectiveTime": "2022-08-18 12:18:19 +0000 UTC", "newEffectiveTime": "2022-08-18 12:18:19 +0000 UTC"}
2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.runnerreplicaset      Skipped reconcilation because owner is not synced yet   {"runnerreplicaset": "github-gpu-k8s/runner-gpu-k8s-vzpr9", "owner": "github-gpu-k8s/runner-gpu-k8s-vzpr9-ctmc6", "pods": null}

### Runner Pod Logs

```shell
2022-08-18 12:34:01.768  DEBUG --- Github endpoint URL https://github.com/
2022-08-18 12:34:02.462  DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2022-08-18 12:34:02.466  DEBUG --- Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

√ Connected to GitHub

# Runner Registration

√ Runner successfully added
√ Runner connection is good

# Runner settings

√ Settings Saved.

2022-08-18 12:34:07.485  DEBUG --- Runner successfully configured.
{
  "agentId": 159,
  "agentName": "runner-gpu-k8s-gtrgf-q87wv",
  "poolId": 1,
  "poolName": "Default",
  "ephemeral": true,
  "serverUrl": "https://pipelines.actions.githubusercontent.com/WbUFHxHNMTckMYpByHhGXjzC31kmXmT97NzsUmBl9gVn3Gj7rj",
  "gitHubUrl": "https://github.com/konstellation-io/arc-poc",
  "workFolder": "/runner/_work"
2022-08-18 12:34:07.490  NOTICE --- Docker wait check skipped. Either Docker is disabled or the wait is disabled, continuing with entrypoint
}
√ Connected to GitHub

Current runner version: '2.295.0'
2022-08-18 12:34:09Z: Listening for Jobs
2022-08-18 12:34:14Z: Running job: Explore-GitHub-Actions


### Additional Context

https://github.com/actions-runner-controller/actions-runner-controller/pull/1546
https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi
mumoshu commented 2 years ago

@aacecandev Hey! Are you saying that the job pod created when you're using the kubernetes container mode in addition to the runner pod is missing those fields? Can I take it as the runner pod have all the expected fields but the job pod is not?

alpiquero commented 2 years ago

Hi @mumoshu. I'm working with @aacecandev and can confirm that is exactly what is happening.

blopezpi commented 2 years ago

@aacecandev Hey! Are you saying that the job pod created when you're using the kubernetes container mode in addition to the runner pod is missing those fields? Can I take it as the runner pod have all the expected fields but the job pod is not?

Hi @mumoshu, the workflow pod doesn't have these values. The pod provided by the runnerdeployment is executing with the fields configured as expected, but the workflow pod that the hook launches doesn't have the same fields, we need to execute the workflow pod with these particular fields to schedule the pod in a GPU node.

aacecandev commented 2 years ago

@aacecandev Hey! Are you saying that the job pod created when you're using the kubernetes container mode in addition to the runner pod is missing those fields? Can I take it as the runner pod have all the expected fields but the job pod is not?

That's right @mumoshu, the above response describes exactly the buggy behavior. We think that the problem could be located in the runnerdeployment_controller not receiving those fields correctly, but we are not totally sure since we've been busy rolling back to from k8s 1.23 to 1.22 and trying to achieve pass-through of the GPU from GKE to DinD (which we've achieved successfully)

rdemorais commented 2 years ago

looking forward to a have nodeSelector on RunnerSet as well, so I can assign pods to my runners nodes.

sbalajisivaram commented 1 year ago

Hi @mumoshu , I recently started to use ARC and I'm noticing the same that i cant seem to use nodeSelector or tolerations in the runner deployment. Btw, i'm using sysbox runtime in the RunnerDeployment configuration but even without sysbox i can't pass nodeselector as the admission control is rejecting the usage.

ACR version - v0.25.2 k8s version - 1.22.15

Error:

Error from server: error when creating "runnerdeploy_new.yaml": admission webhook "mutate.runnerdeployment.actions.summerwind.dev" denied the request: json: cannot unmarshal bool into Go struct field RunnerSpec.spec.template.spec.nodeSelector of type string
sbalajisivaram commented 1 year ago

Hi @mumoshu , I recently started to use ARC and I'm noticing the same that i cant seem to use nodeSelector or tolerations in the runner deployment. Btw, i'm using sysbox runtime in the RunnerDeployment configuration but even without sysbox i can't pass nodeselector as the admission control is rejecting the usage.

ACR version - v0.25.2 k8s version - 1.22.15

Error:

Error from server: error when creating "runnerdeploy_new.yaml": admission webhook "mutate.runnerdeployment.actions.summerwind.dev" denied the request: json: cannot unmarshal bool into Go struct field RunnerSpec.spec.template.spec.nodeSelector of type string

Hi, noticed why it was rejecting as i was passing an incorrect label. Please ignore

mumoshu commented 1 year ago

Hey everyone- sorry for the delayed response.

Workflow job pods are created by the runner container hooks, which are currently owned by GitHub.

ARC is using the hooks without any modifications. AFAIK, the pod spec of the workflow job pods are generated in https://github.com/actions/runner-container-hooks/blob/d988d965c57642f972246e28567301f6b4c054e1/packages/k8s/src/hooks/run-container-step.ts#L78-L110 within the "k8s" runner container hooks, which we embed into our runner images via https://github.com/actions-runner-controller/actions-runner-controller/blob/18077a1e83e346a5c3f3ae57ae9b8792ceb7c292/runner/actions-runner.dockerfile#L87-L90

That said, I guess the right way forward would be to file a feature request to the runner container hooks project too so that we can collaborate on a potential solution. Maybe we can fork/modify the hooks to accept additional envvars or a config file to customize some pod and container fields of workflow job pods. Maybe they can do it for us, which is ideal because then we don't need to repeatedly rebase our fork onto their work.

carlturnerfs commented 1 year ago

Yes, I'm pretty sure that the right solution is

A couple of other important examples of fields you want to set are the security context and service account for the Job.

nielstenboom commented 1 year ago

Would like to mention volumeMounts as well as another example of fields you'd like to set!

nielstenboom commented 1 year ago

I've taken a crack at this problem here: https://github.com/actions/runner-container-hooks/pull/50

It solves it by providing functionality to pass a template file for the newly created pod with containerMode: kubernetes. What do you guys think of this solution?

andre177 commented 1 year ago

I think I got a workaround (not the prettiest solution, but at least it works): I simply patched the RunnerDeployment resource that manages the runner pod adding tolerations and nodeSelector values. You can do this by using kubectl patch or (my case) using kubectl_manifest Terraform resource:

resource "kubectl_manifest" "github_actions_runner_patch" {
  depends_on = [ helm_release.github_actions_runner ]
  override_namespace = "actions-runner-system"
  yaml_body          = <<YAML
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: github-agent-runner
spec:
  template:
    spec:
      nodeSelector:
        node: github-actions-runner
      tolerations:
        - key: github-actions-runner
          effect: NoSchedule
YAML
}
alpiquero commented 1 year ago

The issue here is that when the kubernetes mode is used, the values of nodeSelectors, affinities, tolerations and containers[*].resources are not being passed to the *-workflow. pods that are created when a Github Workflow is executed on a particular runner. Those specs are passed to the runner pods, but not to the pods that execute the workflow for those runners.

We use Kyverno to workaround this. When a *-workflow pod is created, it mutates it to fit it to our needs.

timmjd commented 12 months ago

Looks like actions/runner-container-hooks#96 is going to solve this issue in the future.

jaimehrubiks commented 11 months ago

https://github.com/actions/runner-container-hooks/pull/75 Solved the issue on the container-hooks side, now I guess this repository should implement its part

findmyway commented 11 months ago

This feature is crucial to our application. Any chance to address it in the next release?

evandam commented 9 months ago

:+1: here - we need to use tolerations and nodeSelectors to target ARM nodes for faster multi-platform Docker images. Currently dind works but ideally it would be great to use this approach without privileged containers.

nielstenboom commented 8 months ago

I've taken a stab at implementing this guys -> https://github.com/actions/actions-runner-controller/pull/3174

Would be very grateful for input of a maintainer! 🙌

nielstenboom commented 8 months ago

Actually, working on the implementation made me realize it’s already possible without any code changes to this project if you’re willing to jump through some configuration hoops:

  1. Build and push your own runner containers that have the newest 0.5.0 release of runner-container-hooks to somewhere in your own infra.
FROM summerwind/actions-runner:v2.311.0-ubuntu-20.04
ARG RUNNER_CONTAINER_HOOKS_VERSION=0.5.0

RUN cd "$RUNNER_ASSETS_DIR" \
    && sudo rm -rf ./k8s && pwd \
    && curl -fLo runner-container-hooks.zip https://github.com/actions/runner-container-hooks/releases/download/v${RUNNER_CONTAINER_HOOKS_VERSION}/actions-runner-hooks-k8s-${RUNNER_CONTAINER_HOOKS_VERSION}.zip \
    && unzip ./runner-container-hooks.zip -d ./k8s \
    && rm -f runner-container-hooks.zip

USER runner

ENTRYPOINT ["/bin/bash", "-c"]
CMD ["entrypoint.sh"]

Then make sure you use this image as your runners (can be set in the helm chart)

image:
  actionsRunnerRepositoryAndTag: "myrepo/runner:0.5.0"
  1. Create a ConfigMap that holds your podTemplate (we use it to set a cache for CI), you will mount this configmap as a file into your RunnerDeployment later.
apiVersion: v1
kind: ConfigMap
metadata:
  name: podtemplates
data:
  gpu.yaml: |
    spec:
      securityContext:
        runAsUser: 0
      containers:
        - name: $job # overwrites job container
          env:
          - name: POETRY_CACHE_DIR
            value: "/ci-cache/poetry"

          volumeMounts:
          - name: ci-cache
            mountPath: /ci-cache

      volumes:
      - name: ci-cache
        hostPath:
          path: /root
  1. Create a RunnerDeployment with containerMode: "kubernetes" and point the ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE env var to your mounted template.
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-gpu
spec:
  replicas: 1
  template:
    spec:
      containerMode: "kubernetes"

      labels:
        - my-runners

      # manually add this env var that points to the file location
      # of your template
      env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
          value: "/templates/gpu.yaml"

      # we set the GPU resources on the runners and abuse the fact that
      # all GPUs become available to pods on the same node
      resources:
        limits:
          nvidia.com/gpu: 1

      # mount your configmap into your runner
      volumeMounts:
      - name: templates
        mountPath: /templates
      volumes:
      - name: templates
        configMap:
          name: podtemplates

And now run a CI job that uses this runner. Then you should see that the -workflow pod being created has all the fields you set in the template! 🎉

JustASquid commented 2 months ago

Any updates on this? I'm also interested in a similar application to @nielstenboom