actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.77k stars 1.13k forks source link

Ephemeral runner gets stuck in Successful state #3527

Closed katarzynainit closed 6 months ago

katarzynainit commented 6 months ago

Checks

Controller Version

0.9.0

Deployment Method

Helm

Checks

To Reproduce

1. I have ARC installed in GKE in one namespace: controller and runnersets in the same one
2. All works as expected in most of the cases
3. From time to time (randomly) we observe such behavior:
- ephemeral runnerset gets patch , e.g. from 0 desired to 1 replica
- ephemeral runner is created, but immediately its status is changed to Succeeded and nothing happens - the workload is "stuck" on waiting for runner

Describe the bug

In controller logs I see that it already "Found the runner with the same name" - it looks like the controller is performing reconcile twice for the same ephemeralrunner in almost the same time, the second run "removes" runner and makes it hung.

The runner is eventually not created, and the ephemeral runner gets to stage Succeeded and stuck until workflow is cancelled.

We started to observe this behavior when we moved to faster cluster.

Describe the expected behavior

The controller should create runner always on ephemeral runner creation.

Additional Context

N/A

Controller Logs

https://gist.github.com/katarzynainit/ceccccde10d5454aa104d0f5a98f9b0d

Runner Pod Logs

N/A
github-actions[bot] commented 6 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 6 months ago

Hey @katarzynainit,

Can you please show the controller values.yaml file, so I can try to reproduce this issue.

katarzynainit commented 6 months ago

Hi, we are using forked arc-controller - code changes relate to skipping controller and listeners SA and RBAC creation based on three flags. Code related to processing ephemeral runners is unchanged vs 0.9.0.

https://gist.github.com/katarzynainit/d9e6ed4d3c6b95e929d73e2b1e8f7cc1 (flags for internal changes are marked in the values)

We started to observe this issue on faster cluster, we didn't see them before (the same configuration, but different and slower cluster).

It also happens from time to time only, so might be difficult to observe.