actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.72k stars 1.11k forks source link

Controller has long delays in creating runner pods #3743

Closed joshua-reddish closed 1 month ago

joshua-reddish commented 1 month ago

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

To Reproduce

Our ARC system is installed using Helm, while our listeners are setup using ArgoCD. We are currently on version 0.8.3.

This issue is intermittent. To reproduce, run a job in github with an active listener. Expected behavior is a new runner pod will be provisioned. Actual behavior is it will intermittently take 10-15 minutes to provision a new runner pod

Describe the bug

There is a significant delay between when a job is kicked of in github and a new runner pod is provisioned.

Logs show that the controller taking action, but each action seems to take about 5 minutes to happen, leading to around 15 minutes before the pod is even created.

Describe the expected behavior

The controllers actions are taken in real time, instead of after a ~5 minute delay

Additional Context

We are currently working to upgrade to the latest version of ARC, but this is causing high volume of issues for our consumers. We are also investigating whether the controller is getting resource limited.

It appears to only be happening in one of our clusters, where there is a higher number of listeners, and the volume seen by those listeners is higher.

Here is the timeline for the sample set of logs provided:

16:21 Job is kicked off in the UI, and the listener logs the message
16:26 ARC controllers first logs
16:35 The new runner pod shows up: `tas-8gb-pzvbv-runner-vmdtl`
16:37 The pod is ready and starts the Job
16:38 The job is finished and the pod enters completed state
16:45 The last log for this run in the controller logs
16:46 The pod is deleted

Only 3 minutes of this whole process is the actual compute spinning up and the job's actions taken. We are seeing this across at least a dozen listeners and multiple controllers. This controller `arc-asurion-private` is the worst offender though, and also has the most volume.

Controller Logs

https://gist.github.com/joshua-reddish/fe30772f5a1c84bc9c4d5f484b7afcc6

Runner Pod Logs

These are listener logs as the runner itself runs fine

https://gist.github.com/joshua-reddish/8763514433317f4ef7701aab496ba1e2
joshua-reddish commented 1 month ago

Also of note, our controller is using the default values for resources. I will be looking into specifying a larger reservation of CPU and Memory to see if this is a resource constraint

joshua-reddish commented 1 month ago

One day into version 0.9.3 and we haven't seen the issue yet. Will provide another update in a few days as we see more volume.

I think we were experiencing a bit of a snowball effect leading to a backlog of items for the controller to churn through. The update seems to have helped it more efficiently push through the volume, at least so far.

joshua-reddish commented 1 month ago

We are now seeing some issue with a request being made to a broker server? The pods come up, but the pod logs themselves indicate issues connecting to a github api. No pattern I can discern as to which labels/jobs have this issue.

Here is the gist: https://gist.github.com/joshua-reddish/c8c1fece3b78964e889b8d63be15b4fe

Can anyone assist with diagnosing the issue? It seems sporadic, but is again causing pickup delays, and this time its actually reserving the compute while spinning in circles. It eventually connects and the job proceeds, but it can take over 10 minutes

joshua-reddish commented 1 month ago

Looks like the above may be related to Github server issues lol - https://www.githubstatus.com/incidents/69sb0f8lydp4