Open victoruzunovjuro opened 1 year ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
Im facing the same issue :/
I'm also facing the same problem
same here
Related to this upstream issue?
I am facing this exact problem. There is no saying when it happens exactly. I have seen this happening on long running jobs ~1hour. Any update on fix please? It is related to this issue: https://github.com/actions/actions-runner-controller/issues/2882
Related to this upstream issue?
* [The linux runner does not gracefully shutdown on SIGINT runner#2582](https://github.com/actions/runner/issues/2582)
This seems like it's not entirely relevant. We are talking about the controller sending an api call to k8s to kill the runner pod for w/e reason.
I am facing similar issue. Exactly after 1 hour the runner is getting terminated though the job is running and getting below error.
The self-hosted runner: xxxxxx lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
I am facing the same issue. The runner pod died randomly making the UI hang for 10 minutes and then fail to workflow.
There is
Related to this upstream issue?
This issue and possible fix just add the graceful shutdown, making it not hang for 10 minutes but allowing enough time to make sure that the runner exits gracefully and shows that the run was canceled (instead of hanging). I updated my runner to add this and manually deleted the pod, and I saw that the job is canceled rather than hanging, so it isn't related to the root cause as mentioned by @victoruzunovjuro.
I also tried to make sure it isn't being scaled down due to cluster-autoscaler
, and made sure it wasn't any spot instance termination, but unfortunately I can't see the audit logs for the pod really being terminated by the actions runner controller itself (I do not have the audit logs)
This has started to happen with me after upgrading the runners image to 2.316.0, rolling back to 2.314.1 might fix your issue until they release the fix
@victoruzunovjuro does the lifecyle hook work? Im trying to make the preStop hook work but it doesn't work at all. The runner just gets killed and there is no lifecyle hook execution, even tried with sidecar containers but same issue. Is it related to that sigterm is not received?
Checks
Controller Version
0.27.3
Helm Chart Version
0.23.2
CertManager Version
v1.11.1
Deployment Method
Helm
cert-manager installation
Yes.
Checks
Resource Definitions
To Reproduce
Describe the bug
We are using ARC to configure runners on Bottlerocket spot instances in EKS. We use the webhook based autoscaling approach.
We have noticed that sporadically pods will be deleted by ARC. Steps to we've taken to troubleshoot this:
Below I will provide you with the logs of one case -
github-runners-spot-large-sw7cb-lx6sc
Here is an k8 api event at
2023-05-03T12:04:29.433+03:00
showing that ARC deleted the pod:Describe the expected behavior
It's expected ARC not to delete pods while they are running a job.
Whole Controller Logs