Runner pods randomly receiving sigterm, entire workflow being (sometimes) cancelled

Checks

[X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

1.10.1

Deployment Method

Helm

cert-manager installation

helm upgrade --install --create-namespace -n cert-manager cert-manager . -f values.yaml --set nodeSelector.Name=admin_node --set installCRDs=true

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: genies-internal-content-runner-small-2021.3.31f1
  namespace: actions-runner-system
  labels:
    app: genies-internal-content-runner-small-2021.3.31f1
    env: unity
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      organization: geniesinc
      labels:
        - genies-internal-content-runner-small-2021.3.31f1
        - genies-internal-content-runner-small
      nodeSelector:
        Name: github_runners_content
      group: genies-content-runners-small-2021.3.31f1
      image: summerwind/actions-runner-dind:v2.311.0-ubuntu-20.04-fb11d3b
      imagePullPolicy: IfNotPresent
      dockerdWithinRunnerContainer: true
      ephemeral: true
      resources:
        requests:
          memory: "12Gi"
        limits:
          memory: "12Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner-content-small-autoscaler
  namespace: actions-runner-system
  labels:
    app: genies-internal-content-runner-small-2021.3.31f1
    env: unity
spec:
  scaleDownDelaySecondsAfterScaleOut: 120
  scaleTargetRef:
    kind: RunnerDeployment
    name: genies-internal-content-runner-small-2021.3.31f1
  minReplicas: 0
  maxReplicas: 100
  metrics:
    - type: PercentageRunnersBusy
      scaleUpThreshold: "0.75"  # Updated to scale up when 50% of the runners are busy
      scaleDownThreshold: "0.25"  # You can adjust this value as per your requirements
      scaleUpAdjustment: 2
      scaleDownAdjustment: 2
    - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
      repositoryNames:
      # A repository name is the REPO part of `github.com/OWNER/REPO`
      # All repos that want to use these runners have to be added here.
      - Dynamic-Content-Addressable-Builder
      - Content-Generator-V2
      - dev-kit-unity-sdk
      - GeniesParty
      - Content-Processor
      - genies-gap-wearables
      - Cloud-Processor
      - UnityRendering
      - gha-sandbox

To Reproduce

Cannot reproduce, failures are random

Describe the bug

Workflow jobs, running in self-hosted runners, randomly receive sigterm and get wiped out with no explanation. Our issue seems to be the exact one described in these issues and is not new: https://github.com/actions/actions-runner-controller/issues/2695 https://github.com/actions/actions-runner-controller/discussions/2417

Describe the expected behavior

Workflow jobs not being randomly cancelled, runner pods not receiving sigterm with no explanation

Whole Controller Logs

Full log for the relevant timeframe: https://gist.github.com/whibbard-genies/bfdb3deda606cbdfdccd4210064fe6be

Filtered only containing lines containing the runner pod name: https://gist.github.com/whibbard-genies/e34443349b5a0630e725001dbda9816c

Whole Runner Pod Logs

https://gist.github.com/whibbard-genies/d4cdeeb3cca297b538df48cba8d2a04f

Additional Context

Ruled out OOM issue, it's not that
On-demand nodes are being used, not spot instances
There are NO events that I can see that are showing any issues with running out of storage, as suggested in one of the linked issues
I see nothing on my side that at all points me to what this could be. I am now being forced to write an in-house service that will have to interact with the Github API to detect and automatically retry these failed jobs, as we cannot have this level of unpredictable unreliability. Github Actions really should have a feature for automatically rerunning failed jobs to at least help us handle situations like this.

So while this post is highlighting the issue of the sigterm being sent, it is highlighting a need for a new feature in GHA that I hope will be added eventually.

actions / actions-runner-controller

Runner pods randomly receiving sigterm, entire workflow being (sometimes) cancelled #3650

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Whole Runner Pod Logs

Additional Context