actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

Cannot set resources Requests and Limits for workflow pods #3641

Open kanakaraju17 opened 4 months ago

kanakaraju17 commented 4 months ago

Checks

Controller Version

0.9.2

Deployment Method

Helm

Checks

To Reproduce

1. Deploy the gha-runner-scale-set-controller first with the below command.
   helm install arc . -f values.yaml -narc-systems

2. Deploying the gha-runner-scale-set with Kubernetes mode enabled.
   helm install arc-runner-set . -f values-kubernetes.yaml -narc-runners

Ideal scenario: The workflow pods which comes up should have requested resources and limits set.

Describe the bug

The runner pods, which have names ending with "workflow," should have the specified resource requests and limits for CPU and memory when they are created.

  ##       resources:
  ##         requests:
  ##           memory: "4Gi"
  ##           cpu: "2"
  ##         limits:
  ##           memory: "6Gi"
  ##           cpu: "4"  

Describe the expected behavior

The workflow pod that is created during the pipeline execution should have specific CPU and memory limits and requests set. However, it is not starting with the specified resources and limits.

Additionally, an extra pod is being created when the pipeline runs, alongside the existing runner pods. We need to understand the purpose of the existing runner pod if a new pod is also being initiated. Added the detail of the extra pod in the screenshot below.

Screenshot 2024-07-04 at 4 20 35 PM

Additional Context

Adding the value.yaml file for gha-runner-scale-set below.

## githubConfigUrl is the GitHub url for where you want to configure runners
## ex: https://github.com/myorg/myrepo or https://github.com/myorg
githubConfigUrl: "https://github.com/curefit"

## githubConfigSecret is the k8s secrets to use when auth with GitHub API.
## You can choose to use GitHub App or a PAT token
githubConfigSecret:
  ### GitHub Apps Configuration
  ## NOTE: IDs MUST be strings, use quotes
  #github_app_id: ""
  #github_app_installation_id: ""
  #github_app_private_key: |

  ### GitHub PAT Configuration
  github_token: ""
## If you have a pre-define Kubernetes secret in the same namespace the gha-runner-scale-set is going to deploy,
## you can also reference it via `githubConfigSecret: pre-defined-secret`.
## You need to make sure your predefined secret has all the required secret data set properly.
##   For a pre-defined secret using GitHub PAT, the secret needs to be created like this:
##   > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_token='ghp_your_pat'
##   For a pre-defined secret using GitHub App, the secret needs to be created like this:
##   > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_app_id=123456 --from-literal=github_app_installation_id=654321 --from-literal=github_app_private_key='-----BEGIN CERTIFICATE-----*******'
# githubConfigSecret: pre-defined-secret

## proxy can be used to define proxy settings that will be used by the
## controller, the listener and the runner of this scale set.
#
# proxy:
#   http:
#     url: http://proxy.com:1234
#     credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
#   https:
#     url: http://proxy.com:1234
#     credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
#   noProxy:
#     - example.com
#     - example.org

# maxRunners is the max number of runners the autoscaling runner set will scale up to.
# maxRunners: 5

# minRunners is the min number of idle runners. The target number of runners created will be
# calculated as a sum of minRunners and the number of jobs assigned to the scale set.
minRunners: 3

runnerGroup: "arc-runner-kubernetes-ci-arm-large"

# ## name of the runner scale set to create.  Defaults to the helm release name
runnerScaleSetName: "arc-runner-kubernetes-ci-arm-large"

## A self-signed CA certificate for communication with the GitHub server can be
## provided using a config map key selector. If `runnerMountPath` is set, for
## each runner pod ARC will:
## - create a `github-server-tls-cert` volume containing the certificate
##   specified in `certificateFrom`
## - mount that volume on path `runnerMountPath`/{certificate name}
## - set NODE_EXTRA_CA_CERTS environment variable to that same path
## - set RUNNER_UPDATE_CA_CERTS environment variable to "1" (as of version
##   2.303.0 this will instruct the runner to reload certificates on the host)
##
## If any of the above had already been set by the user in the runner pod
## template, ARC will observe those and not overwrite them.
## Example configuration:
#
# githubServerTLS:
#   certificateFrom:
#     configMapKeyRef:
#       name: config-map-name
#       key: ca.crt
#   runnerMountPath: /usr/local/share/ca-certificates/

## Container mode is an object that provides out-of-box configuration
## for dind and kubernetes mode. Template will be modified as documented under the
## template object.
##
## If any customization is required for dind or kubernetes mode, containerMode should remain
## empty, and configuration should be applied to the template.
containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "gp3"
    resources:
      requests:
        storage: 5Gi
#   kubernetesModeServiceAccount:
#     annotations:

## listenerTemplate is the PodSpec for each listener Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
# listenerTemplate:
#   spec:
#     containers:
#     # Use this section to append additional configuration to the listener container.
#     # If you change the name of the container, the configuration will not be applied to the listener,
#     # and it will be treated as a side-car container.
#     - name: listener
#       securityContext:
#         runAsUser: 1000
#     # Use this section to add the configuration of a side-car container.
#     # Comment it out or remove it if you don't need it.
#     # Spec for this container will be applied as is without any modifications.
#     - name: side-car
#       image: example-sidecar

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  ## template.spec will be modified if you change the container mode
  ## with containerMode.type=dind, we will populate the template.spec with following pod spec
  ## template:
  ##   spec:
  ##     initContainers:
  ##     - name: init-dind-externals
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
  ##       volumeMounts:
  ##         - name: dind-externals
  ##           mountPath: /home/runner/tmpDir
  ##     containers:
  ##     - name: runner
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["/home/runner/run.sh"]
  ##       env:
  ##         - name: DOCKER_HOST
  ##           value: unix:///var/run/docker.sock
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##         - name: dind-sock
  ##           mountPath: /var/run
  ##     - name: dind
  ##       image: docker:dind
  ##       args:
  ##         - dockerd
  ##         - --host=unix:///var/run/docker.sock
  ##         - --group=$(DOCKER_GROUP_GID)
  ##       env:
  ##         - name: DOCKER_GROUP_GID
  ##           value: "123"
  ##       securityContext:
  ##         privileged: true
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##         - name: dind-sock
  ##           mountPath: /var/run
  ##         - name: dind-externals
  ##           mountPath: /home/runner/externals
  ##     volumes:
  ##     - name: work
  ##       emptyDir: {}
  ##     - name: dind-sock
  ##       emptyDir: {}
  ##     - name: dind-externals
  ##       emptyDir: {}
  ######################################################################################################
  ## with containerMode.type=kubernetes, we will populate the template.spec with following pod spec
  ## template:
  ##   spec:
  ##     containers:
  ##     - name: runner
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["/home/runner/run.sh"]
  ##       resources:
  ##         requests:
  ##           memory: "4Gi"
  ##           cpu: "2"
  ##         limits:
  ##           memory: "6Gi"
  ##           cpu: "4"  
  ##       env:
  ##         - name: ACTIONS_RUNNER_CONTAINER_HOOKS
  ##           value: /home/runner/k8s/index.js
  ##         - name: ACTIONS_RUNNER_POD_NAME
  ##           valueFrom:
  ##             fieldRef:
  ##               fieldPath: metadata.name
  ##         - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
  ##           value: "true"
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##     volumes:
  ##       - name: work
  ##         ephemeral:
  ##           volumeClaimTemplate:
  ##             spec:
  ##               accessModes: [ "ReadWriteOnce" ]
  ##               storageClassName: "local-path"
  ##               resources:
  ##                 requests:
  ##                   storage: 1Gi
  spec:
    securityContext:
      fsGroup: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"  
    nodeSelector:
      purpose: github-actions-arm-large
    tolerations:
      - key: purpose
        operator: Equal
        value: github-actions-arm-large
        effect: NoSchedule       
## Optional controller service account that needs to have required Role and RoleBinding
## to operate this gha-runner-scale-set installation.
## The helm chart will try to find the controller deployment and its service account at installation time.
## In case the helm chart can't find the right service account, you can explicitly pass in the following value
## to help it finish RoleBinding with the right service account.
## Note: if your controller is installed to only watch a single namespace, you have to pass these values explicitly.
# controllerServiceAccount:
#   namespace: arc-system
#   name: test-arc-gha-runner-scale-set-controller

And have specidfically mentioned the resources in the kubernetes section:
  ##       resources:
  ##         requests:
  ##           memory: "4Gi"
  ##           cpu: "2"
  ##         limits:
  ##           memory: "6Gi"
  ##           cpu: "4"

Controller Logs

https://gist.github.com/kanakaraju17/31a15aa0a1b5a04fb7eaab6996c02d40

[this is not related to the resource request constraint for the runner pods]

Runner Pod Logs

https://gist.github.com/kanakaraju17/c33c0012f80a48a1e4504bd241c278cc
jonathan-fileread commented 4 months ago

you need to define those in your podtemplate after declaring the podtemplate yml in the scalesetrunner values.yaml. (terraform below btw)

Screenshot 2024-07-05 at 5 07 03 PM

kanakaraju17 commented 4 months ago

Hey @jonathan-fileread, is there a way to configure this in the default values.yaml file provided with the gha-runner-scale-set charts?

jonathan-fileread commented 4 months ago

@kanakaraju17 Hey Kanaka, unfortunately not. you need to create a seperate podtemplate in order to define the workflow pod, as the values.yaml only defines the runner pod settings.

kanakaraju17 commented 4 months ago

@jonathan-fileread, any idea why the file is not getting mounted in the runner pods? I'm using the following configuration and encountering the error below:

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  # with containerMode.type=kubernetes, we will populate the template.spec with following pod spec
  template:
    spec:  
      securityContext:
        fsGroup: 123      
      containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/pod-templates/default.yml
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"      
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: pod-templates
            mountPath: /home/runner/pod-templates
            readOnly: true  
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteOnce" ]
                storageClassName: "gp3"
                resources:
                  requests:
                    storage: 1Gi
        - name: pod-templates
          configMap:
            name: runner-pod-template

ConfigMap Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: runner-pod-template
data:
  default.yml: |
    apiVersion: v1
    kind: PodTemplate
    metadata:
      name: runner-pod-template
    spec:
      containers:
      - name: "$job"
        resources:
          limits:
            cpu: "3000m"
          requests:
            cpu: "3000m"

The pods fail and end up with the below error:

Error: Error: ENOENT: no such file or directory, open '/home/runner/pod-templates/default.yml'
Error: Process completed with exit code 1.

Have you tried recreating it in your environment? Have you come across this error before? It seems to be a mounting issue where the file is not found.

georgblumenschein commented 4 months ago

@kanakaraju17 You can follow the official guide which worked for me at least :)

https://docs.github.com/en/enterprise-server@3.10/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#understanding-runner-container-hooks

In your case that would be something like:

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hook-extension
data:
  content: |
    spec:
      containers:
        - name: "$job"
          resources:
          limits:
            cpu: "3000m"
          requests:
            cpu: "3000m"

Usage:

template:
    spec:
      containers:
      - name: runner
        ...
        env:
          ...
         - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
           value: /home/runner/pod-template/content
        volumeMounts:
          ...
          - name: pod-template
            mountPath: /home/runner/pod-template
            readOnly: true  
      volumes:
        ...
        - name: pod-template
          configMap:
            name: hook-extension
kanakaraju17 commented 4 months ago

Hey @georgblumenschein, Deploying the gha-runner-scale-set by adding the below env variables doesn't seem to reflect.

template:
  template:
    spec:
      containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-template/content
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true" 

Additional ENV Variable Added:

          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-template/content

The workflow pods should include the ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE environment variable and volume mount but it doesn't appear when describing the pods. Currently, the output is missing this variable.

Expected Result: The ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE environment variable and the volume mounts in the workflow pods should be present.

Below are the values.yaml template used to append the environment variable:

template:
  template:
    spec:
      containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-template/content
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: pod-template
            mountPath: /home/runner/pod-template
            readOnly: true  
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteOnce" ]
                storageClassName: "local-path"
                resources:
                  requests:
                    storage: 1Gi
        - name: pod-template
          configMap:
            name: hook-extension          

Problem: The pods should have the volumes mounted with the config map and the specified environment variables added. However, this is not happening as expected.

Current Output:

While Describing the AutoscalingRunnerSet doesn't show the ENV variables added either.

Name:         arc-runner-kubernetes-ci-arm-large
Namespace:    arc-runners-kubernetes-arm
Labels:       actions.github.com/organization=curefit
              actions.github.com/scale-set-name=arc-runner-kubernetes-ci-arm-large
              actions.github.com/scale-set-namespace=arc-runners-kubernetes-arm
              app.kubernetes.io/component=autoscaling-runner-set
              app.kubernetes.io/instance=arc-runner-kubernetes-ci-arm-large
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=arc-runner-kubernetes-ci-arm-large
              app.kubernetes.io/part-of=gha-rs
              app.kubernetes.io/version=0.9.3
              helm.sh/chart=gha-rs-0.9.3
Annotations:  actions.github.com/cleanup-kubernetes-mode-role-binding-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-role-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-service-account-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
              actions.github.com/cleanup-manager-role-binding: arc-runner-kubernetes-ci-arm-large-gha-rs-manager
              actions.github.com/cleanup-manager-role-name: arc-runner-kubernetes-ci-arm-large-gha-rs-manager
              actions.github.com/runner-group-name: arc-runner-kubernetes-ci-arm-large
              actions.github.com/runner-scale-set-name: arc-runner-kubernetes-ci-arm-large
              actions.github.com/values-hash: 8b5caae634d958cc7d295b3166c151d036c7896d2b6165bf908a6a24aec5320
              meta.helm.sh/release-name: arc-runner-set-kubernetes-arm-large
              meta.helm.sh/release-namespace: arc-runners-kubernetes-arm
              runner-scale-set-id: 76
API Version:  actions.github.com/v1alpha1
Kind:         AutoscalingRunnerSet
Metadata:
  Creation Timestamp:  2024-07-16T09:49:56Z
  Finalizers:
    autoscalingrunnerset.actions.github.com/finalizer
  Generation:        1
  Resource Version:  577760766
  UID:               165f74f7-875c-4b8f-a214-96948ec38467
Spec:
  Github Config Secret:  github-token
  Github Config URL:     https://github.com/curefit
  Listener Template:
    Spec:
      Containers:
        Name:  listener
        Resources:
          Limits:
            Cpu:     500m
            Memory:  500Mi
          Requests:
            Cpu:     250m
            Memory:  250Mi
      Node Selector:
        Purpose:  github-actions
      Tolerations:
        Effect:           NoSchedule
        Key:              purpose
        Operator:         Equal
        Value:            github-actions
  Min Runners:            2
  Runner Group:           arc-runner-kubernetes-ci-arm-large
  Runner Scale Set Name:  arc-runner-kubernetes-ci-arm-large
  Template:
    Spec:
      Containers:
        Command:
          /home/runner/run.sh
        Env:
          Name:   ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          Value:  false
          Name:   ACTIONS_RUNNER_CONTAINER_HOOKS
          Value:  /home/runner/k8s/index.js
          Name:   ACTIONS_RUNNER_POD_NAME
          Value From:
            Field Ref:
              Field Path:  metadata.name
        Image:             ghcr.io/actions/actions-runner:latest
        Name:              runner
        Volume Mounts:
          Mount Path:  /home/runner/_work
          Name:        work
      Node Selector:
        Purpose:       github-actions
      Restart Policy:  Never
      Security Context:
        Fs Group:            1001
      Service Account Name:  arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
      Tolerations:
        Effect:    NoSchedule
        Key:       purpose
        Operator:  Equal
        Value:     github-actions
      Volumes:
        Ephemeral:
          Volume Claim Template:
            Spec:
              Access Modes:
                ReadWriteOnce
              Resources:
                Requests:
                  Storage:         5Gi
              Storage Class Name:  gp3
        Name:                      work
Status:
  Current Runners:            2
  Pending Ephemeral Runners:  2
Events:                       <none>
Below is the configmap file which is being used:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hook-extension
  namespace: arc-runners-kubernetes-arm
data:
  content: |
    spec:
      containers:
        - name: "$job"
          resources:
          limits:
            cpu: "3000m"
          requests:
            cpu: "3000m"

expected behavior: The ENV variable ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE getting added along with the volume mounts along the pods which will come up.

marcomarques-bt commented 3 months ago

Hey @kanakaraju17 ,

After 2 days of trail and error I managed to get a working scenario with resource limits applied. Funny thing is we were overcomplicating it using the "hook-exensions". All we need to is add it in the template.spec.containers[0].resources.requests/limits section.

Below is a snippet of the values to pass into Helm (although I am using a HelmRelease with FluxCD, the principle still applies):

  values:
    containerMode:
      type: "kubernetes"
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "standard"
        resources:
          requests:
            storage: 10Gi
    githubConfigSecret: gh-secret
    githubConfigUrl : "https://github.com/<Organisation>"
    runnerGroup: "k8s-nonprod"
    runnerScaleSetName: "self-hosted-k8s" # used as a runner label
    minRunners: 1
    maxRunners: 10
    template:
      spec:
        securityContext:
          fsGroup: 1001
        imagePullSecrets:
          - name: cr-secret
        containers:
          - name: runner
            image: ghcr.io/actions/actions-runner:latest
            command: ["/home/runner/run.sh"]
            resources:
              limits:
                cpu: "2000m"
                memory: "5Gi"
              requests:
                cpu: "200m"
                memory: "512Mi"

I have confirmed that this has been working for me with some CodeQL workflows failing due to "insufficient RAM" lol.

Hope it helps.

kanakaraju17 commented 2 months ago

@marcomarques-bt, I assume that the above configuration works only for runner pods and not the pods where the workflow runs i.e. the workflow pods. The above only works for runner pods.

Refer to the image below, the configuration works for the first pod and not the second pod where the actual job runs.

Screenshot 2024-08-30 at 1 14 15 AM
pyama86 commented 1 month ago

It seems that, similar to the issue mentioned earlier, toleration cannot be configured either.