Closed whibbard-genies closed 4 months ago
Upon some further inspection of the node that was hosting this runner pod, I am finding that it may have been infra related after all. Sadly, I just have no clue as to why it happened. The node went into not ready status at 8am local time, and the runner pod was terminated about 15 minutes later. The node then proceeded to cease reporting its status to datadog (presumably the dd agent pod got wiped out the same as the runners, but I wasn't capturing logs for that one) for a whopping 2 hours before it then went back into ready status. That's pretty incriminating of the infrastructure here, despite memory, ephemeral storage, and CPU usage all being within safe ranges on the node.
What is odd on the Github side is that the entire workflow run was just cancelled, rather than the individual job that was being handled by the runner at that time failing. The subsequent jobs that were intended to run on a different run regardless of success or failure of the previous jobs should have executed, but it did not. That is extremely odd behavior on Github actions end and it's something I should likely report elsewhere.
Unfortunately I don't know what else to check for my issue here, but it perhaps is not actually a Github actions problem at all. I will close this out and leave this post up for others to find if they are suffering from the same problem.
Checks
Controller Version
0.26.0
Helm Chart Version
0.21.1
CertManager Version
1.10.1
Deployment Method
Helm
cert-manager installation
helm upgrade --install --create-namespace -n cert-manager cert-manager . -f values.yaml --set nodeSelector.Name=admin_node --set installCRDs=true
Checks
Resource Definitions
To Reproduce
Describe the bug
Workflow jobs, running in self-hosted runners, randomly receive sigterm and get wiped out with no explanation. Our issue seems to be the exact one described in these issues and is not new: https://github.com/actions/actions-runner-controller/issues/2695 https://github.com/actions/actions-runner-controller/discussions/2417
Describe the expected behavior
Workflow jobs not being randomly cancelled, runner pods not receiving sigterm with no explanation
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
So while this post is highlighting the issue of the sigterm being sent, it is highlighting a need for a new feature in GHA that I hope will be added eventually.