ARC should handle OOM killed runners

antoineozenne-at-leocare commented 4 months ago

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.8.0

Deployment Method

Helm

Checks

[x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy a release of `gha-runner-scale-set` with a `ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE` to customize the resource requests and limits of the runner.
2. Run a job in GitHub and getting this runner OOMKilled.

Describe the bug

When the runner is OOMKilled, nothing appends and the pod stays in OOMKilled status. The controller doesn't seem to handle this case, and the job finally times out.

Describe the expected behavior

I think ARC should handle the case the runner is OMMKilled by stopping the job in GitHub with an error status.

Additional Context

kubectl get pods -n arc-runners
# NAME                                                           READY   STATUS      RESTARTS   AGE
# arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m            1/1     Running     0          13h
# arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m-workflow   0/1     OOMKilled   0          136m

Controller Logs

2024-03-04T00:23:29Z    INFO    EphemeralRunnerSet  Created new ephemeral runner    {"ephemeralrunnerset": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l","namespace":"arc-runners"}, "runner": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m"}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Adding runner registration finalizer    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Successfully added runner registration finalizer    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Adding finalizer    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Successfully added finalizer    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Adding finalizer    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Successfully added finalizer    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Creating new ephemeral runner registration and updating status with runner config   {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z    INFO    EphemeralRunner Creating ephemeral runner JIT config    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Created ephemeral runner JIT config {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "runnerId": 5715}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Updating ephemeral runner status with runnerId and runnerJITConfig  {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Updated ephemeral runner status with runnerId and runnerJITConfig   {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Creating new ephemeral runner secret for jitconfig. {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Creating new secret for ephemeral runner    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Created new secret spec for ephemeral runner    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Created ephemeral runner secret {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "secretName": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m"}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Creating new EphemeralRunner pod.   {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Creating new pod for ephemeral runner   {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Created new pod spec for ephemeral runner   {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Created ephemeral runner pod    {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "runnerScaleSetId": 9, "runnerName": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m", "runnerId": 5715, "configUrl": "https://github.com/XXX", "podName": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m"}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Waiting for runner container status to be available {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z    INFO    EphemeralRunner Waiting for runner container status to be available {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z    INFO    EphemeralRunner Waiting for runner container status to be available {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z    INFO    EphemeralRunner Updating ephemeral runner status with pod phase {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "phase": "Pending", "reason": "", "message": ""}
2024-03-04T00:23:59Z    INFO    EphemeralRunner Updated ephemeral runner status with pod phase  {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:24:13Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:24:13Z    INFO    EphemeralRunner Updating ephemeral runner status with pod phase {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "phase": "Running", "reason": "", "message": ""}
2024-03-04T00:24:13Z    INFO    EphemeralRunner Updated ephemeral runner status with pod phase  {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:24:13Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T11:27:43Z    INFO    EphemeralRunner Ephemeral runner container is still running {"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}

Runner Pod Logs

...
[WORKER 2024-03-04 13:49:04Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2024-03-04 13:49:04Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2024-03-04 13:49:04Z INFO HostContext] Well known directory 'Work': '/home/runner/_work'
[RUNNER 2024-03-04 13:49:14Z INFO JobDispatcher] Successfully renew job request 93068, job is valid till 03/04/2024 13:59:14
[WORKER 2024-03-04 13:49:14Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2024-03-04 13:49:14Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2024-03-04 13:49:14Z INFO HostContext] Well known directory 'Work': '/home/runner/_work'
...

github-actions[bot] commented 4 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 4 months ago

Moved issue to hooks, since the hook should be responsible for maintaining resources that it creates :relaxed:

halradaideh commented 3 weeks ago

happened here as well runner went OOM and the workflow just froze up

nikola-jokic commented 3 weeks ago

Hey everyone,

The main problem is that we do not use the scheduler to schedule pods. The reason is that we need workflow pods to land on the same machine where the runner is. There is an option to use a kube scheduler, however, it requires the ReadWriteMany volume. Under these constraints, we can't do anything else. There is a great PR and a suggestion on how to work around the issue. Hopefully, we can dedicate time in the future to test it and double-check if it works. However, OOM killed is raised by k8s, so the best thing you can do at this time is to ensure your nodes can handle the load, or use the read write many volume to allow workflow pods to be scheduled on different nodes.

halradaideh commented 3 weeks ago

I think i am facing different issue

I am using arc with dind template as explained in the documentation The pod resources created by the scaleset are limited Like a core and 2 gb ram The issue happens when the workflow requests more resources from what the scaleset runner has defined Causes kubernetes to kill the pod

Instead of getting something useful on the action logs, like returning oom status Kubernetes reschedule the pod And it stuck at waiting IPV from dispatcher at the same time the action logs get stuck

and i have to force kill the action

actions / runner-container-hooks