Open antoineozenne-at-leocare opened 4 months ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
Moved issue to hooks, since the hook should be responsible for maintaining resources that it creates :relaxed:
happened here as well runner went OOM and the workflow just froze up
Hey everyone,
The main problem is that we do not use the scheduler to schedule pods. The reason is that we need workflow pods to land on the same machine where the runner is. There is an option to use a kube scheduler, however, it requires the ReadWriteMany
volume.
Under these constraints, we can't do anything else. There is a great PR and a suggestion on how to work around the issue. Hopefully, we can dedicate time in the future to test it and double-check if it works. However, OOM killed is raised by k8s, so the best thing you can do at this time is to ensure your nodes can handle the load, or use the read write many volume to allow workflow pods to be scheduled on different nodes.
I think i am facing different issue
I am using arc with dind template as explained in the documentation The pod resources created by the scaleset are limited Like a core and 2 gb ram The issue happens when the workflow requests more resources from what the scaleset runner has defined Causes kubernetes to kill the pod
Instead of getting something useful on the action logs, like returning oom status Kubernetes reschedule the pod And it stuck at waiting IPV from dispatcher at the same time the action logs get stuck
and i have to force kill the action
Checks
Controller Version
0.8.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
When the runner is OOMKilled, nothing appends and the pod stays in OOMKilled status. The controller doesn't seem to handle this case, and the job finally times out.
Describe the expected behavior
I think ARC should handle the case the runner is OMMKilled by stopping the job in GitHub with an error status.
Additional Context
Controller Logs
Runner Pod Logs