actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.74k stars 1.12k forks source link

Runner pods randomly receiving sigterm, entire workflow being (sometimes) cancelled #3650

Closed whibbard-genies closed 4 months ago

whibbard-genies commented 4 months ago

Checks

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

1.10.1

Deployment Method

Helm

cert-manager installation

helm upgrade --install --create-namespace -n cert-manager cert-manager . -f values.yaml --set nodeSelector.Name=admin_node --set installCRDs=true

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: genies-internal-content-runner-small-2021.3.31f1
  namespace: actions-runner-system
  labels:
    app: genies-internal-content-runner-small-2021.3.31f1
    env: unity
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      organization: geniesinc
      labels:
        - genies-internal-content-runner-small-2021.3.31f1
        - genies-internal-content-runner-small
      nodeSelector:
        Name: github_runners_content
      group: genies-content-runners-small-2021.3.31f1
      image: summerwind/actions-runner-dind:v2.311.0-ubuntu-20.04-fb11d3b
      imagePullPolicy: IfNotPresent
      dockerdWithinRunnerContainer: true
      ephemeral: true
      resources:
        requests:
          memory: "12Gi"
        limits:
          memory: "12Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner-content-small-autoscaler
  namespace: actions-runner-system
  labels:
    app: genies-internal-content-runner-small-2021.3.31f1
    env: unity
spec:
  scaleDownDelaySecondsAfterScaleOut: 120
  scaleTargetRef:
    kind: RunnerDeployment
    name: genies-internal-content-runner-small-2021.3.31f1
  minReplicas: 0
  maxReplicas: 100
  metrics:
    - type: PercentageRunnersBusy
      scaleUpThreshold: "0.75"  # Updated to scale up when 50% of the runners are busy
      scaleDownThreshold: "0.25"  # You can adjust this value as per your requirements
      scaleUpAdjustment: 2
      scaleDownAdjustment: 2
    - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
      repositoryNames:
      # A repository name is the REPO part of `github.com/OWNER/REPO`
      # All repos that want to use these runners have to be added here.
      - Dynamic-Content-Addressable-Builder
      - Content-Generator-V2
      - dev-kit-unity-sdk
      - GeniesParty
      - Content-Processor
      - genies-gap-wearables
      - Cloud-Processor
      - UnityRendering
      - gha-sandbox

To Reproduce

Cannot reproduce, failures are random

Describe the bug

Workflow jobs, running in self-hosted runners, randomly receive sigterm and get wiped out with no explanation. Our issue seems to be the exact one described in these issues and is not new: https://github.com/actions/actions-runner-controller/issues/2695 https://github.com/actions/actions-runner-controller/discussions/2417

Describe the expected behavior

Workflow jobs not being randomly cancelled, runner pods not receiving sigterm with no explanation

Whole Controller Logs

Full log for the relevant timeframe: https://gist.github.com/whibbard-genies/bfdb3deda606cbdfdccd4210064fe6be

Filtered only containing lines containing the runner pod name: https://gist.github.com/whibbard-genies/e34443349b5a0630e725001dbda9816c

Whole Runner Pod Logs

https://gist.github.com/whibbard-genies/d4cdeeb3cca297b538df48cba8d2a04f

Additional Context

So while this post is highlighting the issue of the sigterm being sent, it is highlighting a need for a new feature in GHA that I hope will be added eventually.

whibbard-genies commented 4 months ago

Upon some further inspection of the node that was hosting this runner pod, I am finding that it may have been infra related after all. Sadly, I just have no clue as to why it happened. The node went into not ready status at 8am local time, and the runner pod was terminated about 15 minutes later. The node then proceeded to cease reporting its status to datadog (presumably the dd agent pod got wiped out the same as the runners, but I wasn't capturing logs for that one) for a whopping 2 hours before it then went back into ready status. That's pretty incriminating of the infrastructure here, despite memory, ephemeral storage, and CPU usage all being within safe ranges on the node.

image

What is odd on the Github side is that the entire workflow run was just cancelled, rather than the individual job that was being handled by the runner at that time failing. The subsequent jobs that were intended to run on a different run regardless of success or failure of the previous jobs should have executed, but it did not. That is extremely odd behavior on Github actions end and it's something I should likely report elsewhere.

Unfortunately I don't know what else to check for my issue here, but it perhaps is not actually a Github actions problem at all. I will close this out and leave this post up for others to find if they are suffering from the same problem.