actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

I want to disable "Pod has failed to start more than 5 times" #2721

Open kahirokunn opened 1 year ago

kahirokunn commented 1 year ago

What would you like added?

Currently, if a user fails five times, they are marked as failed, but how about making it possible to set this to any number in the settings? If you set -1, it will be infinite.

Why is this needed?

RUNNER was started on a SPOT instance with minRunners at 1. Each time the spot instance terminates, the runner naturally terminates as well. When runner terminates, a failure is added to EphemeralRunner. Because of this, I cannot make the runner reside on the spot instance.

Additional context

https://github.com/actions/actions-runner-controller/blob/2bf83d0d7f3200b167955a66c2c3fd51e3e7e6e0/controllers/actions.github.com/ephemeralrunner_controller.go#L426

https://github.com/actions/actions-runner-controller/blob/2bf83d0d7f3200b167955a66c2c3fd51e3e7e6e0/controllers/actions.github.com/ephemeralrunner_controller.go#L192

https://github.com/actions/actions-runner-controller/issues/2722

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 1 year ago

Hey @kahirokunn,

Thanks for submitting this issue! This makes sense to me :relaxed: The 5 limit was arbitrarily chosen so if the image is faulty, the controller would not re-create it indefinitely. But we should make this limit configurable, in my opinion

kahirokunn commented 1 year ago

@nikola-jokic How about setting Limit to an arbitrary value, and then setting -1 to infinite?

nikola-jokic commented 1 year ago

Hey @kahirokunn,

We decided not to implement infinite retries. The reason is that having min runners should ensure that you have some of them "warm", while spot instances are kind of the opposite. This mechanism is created, so the faulty configuration can be spotted as well as not to issue many requests to the actions service in case the error is not recoverable without manual intervention.

I will close this issue now but feel free to comment on it :relaxed:

kahirokunn commented 1 year ago

@nikola-jokic Now on to the documentation, etc, Running a minRunner's worth of pods on top of a SPOT instance is not recommended as it causes a has failed to start more than 5 times error. How about adding a "has failed to start more than 5 times" document? I am thinking of writing it myself if you wish.

apc-jnytnai0613 commented 1 year ago

@nikola-jokic Instead of an infinite number of retries, why not allow the user to specify the number of retries? I can implement this.

nikola-jokic commented 1 year ago

That is a good suggestion @kahirokunn, we should probably include it in the docs or FAQs and in the values.yaml file

@apc-jnytnai0613 if we implemented user specified number of retries, it would effectively be the same as having infinite retries. If you specify a sufficiently large number, it would act as almost an infinite retry until you figure out that something is not working as expected.

Having a limited number of retries also signals that something is wrong, and it should probably be fixed. The 5 is not a magic number here, but it was a number that is sufficiently large that if after 5 times something does not work, it is likely an error. This number is not set in stone, and is subject to change, but so far, the only issue with it is running on a spot instance which ARC is not designed for

romancin commented 1 year ago

Hi everyone! We are experiencing the same problem. We have configured a minRunners to 1 so we don't have to wait for a node provision when a new workflow is triggered, which adds abut 4 minutes more to our job time. When Google stops our instance, and starts a new one, the EphemeralRunner stucks in failed state. We must delete it manually to have our Workflows working again...

Why not using spot instances for this? I think using them for CI worflows is a perfect use case... Spot instances are 60% of the cost of a standard one, which is ideal for non always running workflows. IMHO, I don't see a problem on making the failure limit configurable. Setting it to a higer value (non infinite), would solve this problem.

nikola-jokic commented 1 year ago

Hey @romancin,

I completely agree that using spot instances is a good choice. But the idea of spot instances is somewhat contrary to specifying minRunners. Spot instances are picked to save money when there is no activity on your scale set, while minRunners is used to speed up the startup time of workflows when the scale set is inactive (because you have the number of min runners ready to receive a job and thus don't have to wait for the runner to be up). Having said that, it is not obvious what the correct value should be. It would highly depend on the times when workflows are scheduled. In the worst case, it would effectively be the same as having minRunners: 0.

One of the benefits of scale sets is also that they can avoid rate limiting. And having infinite or large number of restarts in these scenarios can unnecessarily increase the load on the server without any performance benefits

chlunde commented 11 months ago

@nikola-jokic This not only affects spot instances but any kind of interruption or eviction, such as node upgrades or consolidation.

Another issue here is that there's no defined time period for the five failures.

For example, over a weekend, if there are no jobs in a git repository, and an ephemeral runner is unlucky enough to be rescheduled five times, GitHub Actions practically stop working for that EphemeralRunnerSet and require manual intervention.

In general, I think most else in k8s is self-healing, so removing this limit or adding time/rate limiting would be more in line with the behavior elsewhere in Kubernetes. For example, we have many Deployment/ReplicaSet instances affected by the same node scale-downs, and we don't have to do any manual intervention there.

Idea for improvement: Keep timestamps (instead of "true") when tracking failures. Garbage collect failures over 15 minutes old when appending to the array. This would require 5 failures over 15 minutes to get stuck.

kahirokunn commented 11 months ago

@chlunde I consider your idea good.

kahirokunn commented 10 months ago

/reopen

gavriellan-ma commented 4 months ago

@chlunde I consider your idea good.

+1