Closed joshua-reddish closed 1 month ago
Also of note, our controller is using the default values for resources. I will be looking into specifying a larger reservation of CPU and Memory to see if this is a resource constraint
One day into version 0.9.3 and we haven't seen the issue yet. Will provide another update in a few days as we see more volume.
I think we were experiencing a bit of a snowball effect leading to a backlog of items for the controller to churn through. The update seems to have helped it more efficiently push through the volume, at least so far.
We are now seeing some issue with a request being made to a broker server? The pods come up, but the pod logs themselves indicate issues connecting to a github api. No pattern I can discern as to which labels/jobs have this issue.
Here is the gist: https://gist.github.com/joshua-reddish/c8c1fece3b78964e889b8d63be15b4fe
Can anyone assist with diagnosing the issue? It seems sporadic, but is again causing pickup delays, and this time its actually reserving the compute while spinning in circles. It eventually connects and the job proceeds, but it can take over 10 minutes
Looks like the above may be related to Github server issues lol - https://www.githubstatus.com/incidents/69sb0f8lydp4
Checks
Controller Version
0.8.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
There is a significant delay between when a job is kicked of in github and a new runner pod is provisioned.
Logs show that the controller taking action, but each action seems to take about 5 minutes to happen, leading to around 15 minutes before the pod is even created.
Describe the expected behavior
The controllers actions are taken in real time, instead of after a ~5 minute delay
Additional Context
Controller Logs
Runner Pod Logs