Open oeuftete opened 1 year ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
@oeuftete Hey. Could you share runner pod logs and a few kubectl describe pod
outputs from ill-behaved runner pods? Those are crucial to further diagnose this issue, as that might tell you why the runner pods are stuck in Terminating.
@oeuftete Hey. Could you share runner pod logs and a few
kubectl describe pod
outputs from ill-behaved runner pods? Those are crucial to further diagnose this issue, as that might tell you why the runner pods are stuck in Terminating.
@mumoshu I'll have to wait until it happens again, though I expect it will soon.
I'm not really concerned about why the pods are stuck in the context of this issue, though. The issue I wanted to raise here was that having a stuck pod seems to prevent downscaling of runners that are healthy Running
but idle. Once I remove the stuck Terminating
pods, the downscaling of the idle Running
pods happens more or less immediately.
@oeuftete Thanks for your confirmation! I guess it was just a coincidence. PercentageRunnersBusy works solely based on responses from some GitHub Actions API calls that might be cached and hence delayed approx 60 seconds or so to reflect the actual state of the runners. ARC doesn't consider pod statuses as far as the calculation of the desired replicas is concerned. Maybe we can see if it's actually a coincidence or not if you could provide more logs that I asked.
@mumoshu I've added new logs (including the pod logs, as exported from Datadog) now in the edited summary. I patched the finalizer on the one pod stuck in Terminating
at ~16:24:15 UTC. You can see the suggested desired replicas drop rapidly from 3 to 0 once the single stuck Terminating
pod is cleaned up.
Edit: added gist for the single stuck pod's describe
: https://gist.github.com/oeuftete/b3ae28123d69330638e04aedd5ef6039
❯ grep Suggested /tmp/arc.log | tail -5
2022-12-08T16:19:19Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 3 by PercentageRunnersBusy {"replicas_desired_before": 3, "replicas_desired": 3, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:24:07Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 3 by PercentageRunnersBusy {"replicas_desired_before": 3, "replicas_desired": 3, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 1, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:28:56Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 2 by PercentageRunnersBusy {"replicas_desired_before": 3, "replicas_desired": 2, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:28:56Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 1 by PercentageRunnersBusy {"replicas_desired_before": 2, "replicas_desired": 1, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
2022-12-08T16:28:56Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 0 by PercentageRunnersBusy {"replicas_desired_before": 1, "replicas_desired": 0, "num_runners": 3, "num_runners_registered": 3, "num_runners_busy": 0, "num_terminating_busy": 0, "namespace": "actions-runner-system", "kind": "runnerdeployment", "name": "evi-platform-study-dev-runner", "horizontal_runner_autoscaler": "evi-platform-study-dev-runner", "enterprise": "evidation-health", "organization": "", "repository": ""}
I'm facing this issue also
@mar-pan Hey! Are you still using PercentageRunnersBusy
? Our recommended autoscaling solution is either a webhook-based one or the new RunnerScaleSet which is currently in the beta testing phase.
Checks
Controller Version
0.26.0
Helm Chart Version
0.21.0
CertManager Version
1.8.0
Deployment Method
Helm
cert-manager installation
✅
Checks
Resource Definitions
To Reproduce
Describe the bug
Although there were no non-terminating busy runners, desired replicas remained at 5. Once the
Terminating
pods were removed by removing their finalizers, scaledown to the reserved limit occurred in the next cycle.Describe the expected behavior
Even with stuck
Terminating
pods, idle runners are scaled down.Whole Controller Logs
Whole Runner Pod Logs
Additional Context
No response