actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.68k stars 1.11k forks source link

Runner pod count always more than desired / max count defined in HRA #3203

Open prabhu-mns opened 9 months ago

prabhu-mns commented 9 months ago

Checks

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Yes

Checks

Resource Definitions

HRA:
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"actions.summerwind.dev/v1alpha1","kind":"HorizontalRunnerAutoscaler","metadata":{"annotations":{},"name":"mac-horizontal-runner-autoscaler","namespace":"github-selfhosted-runner-autoscaler"},"spec":{"maxReplicas":100,"minReplicas":0,"scaleTargetRef":{"kind":"RunnerDeployment","name":"mac-runner-deployment"},"scaleUpTriggers":[{"duration":"30m","githubEvent":{"workflowJob":{}}}]}}
  creationTimestamp: "2022-10-03T09:55:07Z"
  generation: 416944
  name: mac-horizontal-runner-autoscaler
  namespace: github-selfhosted-runner-autoscaler
  resourceVersion: "772046368"
  uid: b758c1c1-42aa-416a-81c3-f23c5b1074b3
spec:
  maxReplicas: 20
  minReplicas: 10
  scaleDownDelaySecondsAfterScaleOut: 90
  scaleTargetRef:
    kind: RunnerDeployment
    name: mac-runner-deployment
  scaleUpTriggers:
  - duration: 30m0s
    githubEvent:
      workflowJob: {}
status:
  desiredReplicas: 10
  lastSuccessfulScaleOutTime: "2023-11-30T17:36:35Z"

Runner deployment:
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"actions.summerwind.dev/v1alpha1","kind":"RunnerDeployment","metadata":{"annotations":{},"creationTimestamp":"2022-10-03T09:55:07Z","generation":363290,"name":"mac-runner-deployment","namespace":"github-selfhosted-runner-autoscaler","resourceVersion":"696673489","uid":"c594a2c7-103d-48f4-b44c-0e289fc4b127"},"spec":{"effectiveTime":"2023-10-11T04:08:04Z","replicas":16,"template":{"spec":{"containers":[{"name":"runner","securityContext":{"privileged":true}}],"env":[{"name":"DISABLE_RUNNER_UPDATE","value":"true"}],"group":"github-selfhosted-runner-autoscaler-mac","image":"actions-runner:2.310.2-fix","imagePullPolicy":"IfNotPresent","imagePullSecrets":[{"name":"prod-acr-secret"}],"labels":["github-selfhosted-runner-autoscaler-mac","ep-selfhosted"],"nodeSelector":{"agent":"self-hosted125"},"organization":"ORGNAME"}}},"status":{"availableReplicas":16,"desiredReplicas":16,"readyReplicas":16,"replicas":16,"updatedReplicas":16}}
  creationTimestamp: "2022-10-03T09:55:07Z"
  generation: 377189
  name: mac-runner-deployment
  namespace: github-selfhosted-runner-autoscaler
  resourceVersion: "772126354"
  uid: c594a2c7-103d-48f4-b44c-0e289fc4b127
spec:
  effectiveTime: "2023-10-27T23:58:41Z"
  replicas: 10
  template:
    spec:
      containers:
      - name: runner
        securityContext:
          privileged: true
      env:
      - name: DISABLE_RUNNER_UPDATE
        value: "true"
      group: github-selfhosted-runner-autoscaler-mac
      image: actions-runner:2.311.0-fix
      imagePullPolicy: IfNotPresent
      imagePullSecrets:
      - name: prod-acr-secret
      labels:
      - github-selfhosted-runner-autoscaler-mac
      - ep-selfhosted
      nodeSelector:
        agent: self-hosted125
      organization: 
status:
  availableReplicas: 0
  desiredReplicas: 10
  readyReplicas: 0
  replicas: 361
  updatedReplicas: 361

To Reproduce

1. Deploy runner deployment 
2. Run GitHub actions parallel workflows
3. Pod count goes beyond the desired / max value defined in HRA

Describe the bug

Pod count goes beyond the desired / max value defined in HRA. Because of this, each runner deployment pods has > 100 pods approx. which tends to autoscale more no. of nodes just because the node reaches the max pod limit though enough CPU, and memory resources are available, and when max node count is reached, pods are not able to scheduled and it goes to pending state which impacts the new workflow runs

Describe the expected behavior

Pod count stays within the desired, max value defined in the HRA

Whole Controller Logs

https://gist.github.com/prabhu-mns/45e732ddec3ce157d54a50ea4bab65c7

Whole Runner Pod Logs

Runner docker container logs:
https://gist.github.com/prabhu-mns/eaacce96706495e2a9a8321ec6018f3a

Runner pod logs:
https://gist.github.com/prabhu-mns/0b78c5fb138654c7bf040e6fdb77a085

Additional Context

No response

github-actions[bot] commented 9 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

prabhu-mns commented 9 months ago

Hi, Any feedback on this would be appreciated as this is impacting our Production workloads.