Runner pod count always more than desired / max count defined in HRA

prabhu-mns commented 9 months ago

Checks

[X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Yes

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

HRA:
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"actions.summerwind.dev/v1alpha1","kind":"HorizontalRunnerAutoscaler","metadata":{"annotations":{},"name":"mac-horizontal-runner-autoscaler","namespace":"github-selfhosted-runner-autoscaler"},"spec":{"maxReplicas":100,"minReplicas":0,"scaleTargetRef":{"kind":"RunnerDeployment","name":"mac-runner-deployment"},"scaleUpTriggers":[{"duration":"30m","githubEvent":{"workflowJob":{}}}]}}
  creationTimestamp: "2022-10-03T09:55:07Z"
  generation: 416944
  name: mac-horizontal-runner-autoscaler
  namespace: github-selfhosted-runner-autoscaler
  resourceVersion: "772046368"
  uid: b758c1c1-42aa-416a-81c3-f23c5b1074b3
spec:
  maxReplicas: 20
  minReplicas: 10
  scaleDownDelaySecondsAfterScaleOut: 90
  scaleTargetRef:
    kind: RunnerDeployment
    name: mac-runner-deployment
  scaleUpTriggers:
  - duration: 30m0s
    githubEvent:
      workflowJob: {}
status:
  desiredReplicas: 10
  lastSuccessfulScaleOutTime: "2023-11-30T17:36:35Z"

Runner deployment:
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"actions.summerwind.dev/v1alpha1","kind":"RunnerDeployment","metadata":{"annotations":{},"creationTimestamp":"2022-10-03T09:55:07Z","generation":363290,"name":"mac-runner-deployment","namespace":"github-selfhosted-runner-autoscaler","resourceVersion":"696673489","uid":"c594a2c7-103d-48f4-b44c-0e289fc4b127"},"spec":{"effectiveTime":"2023-10-11T04:08:04Z","replicas":16,"template":{"spec":{"containers":[{"name":"runner","securityContext":{"privileged":true}}],"env":[{"name":"DISABLE_RUNNER_UPDATE","value":"true"}],"group":"github-selfhosted-runner-autoscaler-mac","image":"actions-runner:2.310.2-fix","imagePullPolicy":"IfNotPresent","imagePullSecrets":[{"name":"prod-acr-secret"}],"labels":["github-selfhosted-runner-autoscaler-mac","ep-selfhosted"],"nodeSelector":{"agent":"self-hosted125"},"organization":"ORGNAME"}}},"status":{"availableReplicas":16,"desiredReplicas":16,"readyReplicas":16,"replicas":16,"updatedReplicas":16}}
  creationTimestamp: "2022-10-03T09:55:07Z"
  generation: 377189
  name: mac-runner-deployment
  namespace: github-selfhosted-runner-autoscaler
  resourceVersion: "772126354"
  uid: c594a2c7-103d-48f4-b44c-0e289fc4b127
spec:
  effectiveTime: "2023-10-27T23:58:41Z"
  replicas: 10
  template:
    spec:
      containers:
      - name: runner
        securityContext:
          privileged: true
      env:
      - name: DISABLE_RUNNER_UPDATE
        value: "true"
      group: github-selfhosted-runner-autoscaler-mac
      image: actions-runner:2.311.0-fix
      imagePullPolicy: IfNotPresent
      imagePullSecrets:
      - name: prod-acr-secret
      labels:
      - github-selfhosted-runner-autoscaler-mac
      - ep-selfhosted
      nodeSelector:
        agent: self-hosted125
      organization: 
status:
  availableReplicas: 0
  desiredReplicas: 10
  readyReplicas: 0
  replicas: 361
  updatedReplicas: 361

To Reproduce

1. Deploy runner deployment 
2. Run GitHub actions parallel workflows
3. Pod count goes beyond the desired / max value defined in HRA

Describe the bug

Pod count goes beyond the desired / max value defined in HRA. Because of this, each runner deployment pods has > 100 pods approx. which tends to autoscale more no. of nodes just because the node reaches the max pod limit though enough CPU, and memory resources are available, and when max node count is reached, pods are not able to scheduled and it goes to pending state which impacts the new workflow runs

Describe the expected behavior

Pod count stays within the desired, max value defined in the HRA

Whole Controller Logs

https://gist.github.com/prabhu-mns/45e732ddec3ce157d54a50ea4bab65c7

Whole Runner Pod Logs

Runner docker container logs:
https://gist.github.com/prabhu-mns/eaacce96706495e2a9a8321ec6018f3a

Runner pod logs:
https://gist.github.com/prabhu-mns/0b78c5fb138654c7bf040e6fdb77a085

Additional Context

No response

github-actions[bot] commented 9 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

prabhu-mns commented 9 months ago

Hi, Any feedback on this would be appreciated as this is impacting our Production workloads.

actions / actions-runner-controller