Closed wasd171 closed 4 months ago
Hey @wasd171,
Thank you for reporting this! The root cause of the problem is that in 0.9.1, the assumption was that if there is an empty batch, we self-correct the count. My assumption was that 50s is enough time for the cluster to be ready, but in this case, it was obviously wrong.
An important thing to point out is that it should self correct on the next run on the scale set. However, this should be fixed, so again, thank you for bringing this to our attention! Please, use the older version of ARC in the meantime. You can also use the older version of the listener that doesn't propagate patch ID, which may occasionally create more pods than necessary, but it would handle this case appropriately.
@nikola-jokic , I want to check if this is the same issue I am having.
I have a listener that scheduled 7 jobs to 7 arc-runner pods, but those pods were pending while waiting for a node to come online. After about a minute, all of those pods went away, and then 1-2 minutes later the node was ready for them to run on. The pods didn't come back up, though.
Meanwhile, in github actions, the jobs were stuck waiting for a runner group. The listener logs in arc-systems shows that those jobs were assigned, but the pods they were assigned to had been terminated (probably by the controller) and so the jobs were lost never to be scheduled.
Does this sound like the same thing, and would rolling back to 0.9.0 to address the problem?
Hey @koreyGambill,
I think so. Rolling back to 0.9.0 should fix the problem. I have to mention that in a very unlucky case, 0.9.0 can increase latency of starting a job. It is a rare case, and if you have a busy cluster, it is very unlikely to happen. However, if you want to be 100% sure everything is executed as quickly as possible in every situation, please rollback to 0.8.3. This version of controller can create and delete ephemeral runners more than 0.9.0, but will ensure that the runner is created as soon as possible. In the next release (0.9.2), the controller should be able to pick up everything right away, and decrease the number of created pods.
If we could set the waiting time for pods in the gha-runner-scale-set-controller values.yaml, that would be great :)
Hey @ywang-psee,
We don't have a waiting time, it was a bug introduced in 0.9.1 :disappointed: This PR should fix it.
Think we are having the same problem. Also using Karpenter here, and actually saw the listener create the appropriate ephemeralrunnerset settings, then ephemeralrunners started creating, then actual pods, but they all get killed off very quickly and it's back to 0 before the queued jobs can be picked up.
Have reverted to 0.9.0 and things seem to be working again.
Yep, same here, we reverted to 0.9.0 as well! ...Keeping an eye on this PR #3204
@nikola-jokic Hi, do you know when will release 0.9.2 happen ?
Hey @chenghuang-mdsol,
I cannot promise anything, but hopefully, later this week or next week.
I need to dig into it but "rolling back" to 0.8.3 results in chart diff errors with regard to clusterrole and the namespace it's in. Chart behavior must have changed with regard to where accounts are created (which namespace). I set the namespace aligned with the quick start docs (arc-runners and arc-systems) - do most people just use kube-system? We are still testing so I can blow it all away but just wondering.
Also for removal - will helm uninstall
do this cleanly or is there CRD cleanup required?
Hey everyone :wave:
Would anyone be interested in testing out this fix before we release it please :pray:
To do it, you can follow these steps:
Thank you in advance! :heart:
Thanks @nikola-jokic I just tested with the canary image tag and the helm chart, and looks like it fixed the issue
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
arc arc-systems 1 2024-05-18 00:57:23.367937 +0900 KST deployed gha-runner-scale-set-controller-0.9.1 0.9.1
runner-arm64 arc-runners 1 2024-05-18 00:57:25.176608 +0900 KST deployed gha-runner-scale-set-0.9.1 canary-3bda9bb
$ kubectl get pod -n arc-runners
NAME READY STATUS RESTARTS AGE
runner-arm64-k9mpd-runner-j7g9n 2/2 Running 0 82s
it took roughly 70~80 sec for the node to be ready and for the runner pod to start running, and the controller tolerated the start time.
However, one thing I want to call out is that when I firstly tried to install this canary image and chart, the listener failed to boot up with the following missing role.
2024-05-17T15:44:10Z ERROR AutoscalingRunnerSet Failed to remove finalizers on dependent resources {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}, "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*AutoscalingRunnerSetReconciler).Reconcile
github.com/actions/actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go:139
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z ERROR Reconciler error {"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"runner-arm64","namespace":"arc-runners"}, "namespace": "arc-runners", "name": "runner-arm64", "reconcileID": "1f35d8c0-b14f-471b-bbc0-c132a7a5196b", "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z INFO AutoscalingRunnerSet Deleting resources {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}
2024-05-17T15:44:10Z INFO AutoscalingRunnerSet Cleaning up the listener {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}
Maybe it was just a timing issue that role/rolebinding created after the controller got created, or just a one-off blip because I couldn't reproduce it after all. I thought I would share it just in case anyways.
HI , i have same issues in 0.9.3 @nikola-jokic
2024-07-09T07:38:10.595179313Z stdout F [RUNNER 2024-07-09 07:38:10Z ERR JobDispatcher] Catch exception during renew runner jobrequest 79794.
2024-07-09T07:38:11.246446746Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] System.TimeoutException: The HTTP request timed out after 00:01:00.
2024-07-09T07:38:11.246461895Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
2024-07-09T07:38:11.246467637Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
2024-07-09T07:38:11.246487934Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246493242Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
2024-07-09T07:38:11.246497368Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
2024-07-09T07:38:11.246500393Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246503597Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246506861Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] --- End of inner exception stack trace ---
2024-07-09T07:38:11.246509682Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246512302Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246515406Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
2024-07-09T07:38:11.246518675Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
2024-07-09T07:38:11.246521713Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.DistributedTask.WebApi.TaskAgentHttpClient.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken, Func`3 processResponse)
2024-07-09T07:38:11.246530257Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.DistributedTask.WebApi.TaskAgentHttpClient.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken, Func`3 processResponse)
2024-07-09T07:38:11.246533658Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Runner.Listener.JobDispatcher.RenewJobRequestAsync(IRunnerServer runnerServer, Int32 poolId, Int64 requestId, Guid lockToken, String orchestrationId, TaskCompletionSource`1 firstJobRequestRenewed, CancellationToken token)
2024-07-09T07:38:11.246537357Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] #####################################################
2024-07-09T07:38:11.246539905Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] System.Threading.Tasks.TaskCanceledException: A task was canceled.
2024-07-09T07:38:11.246542555Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
2024-07-09T07:38:11.246545359Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246556749Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
2024-07-09T07:38:11.246559539Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
2024-07-09T07:38:11.246562315Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246565222Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246568123Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO JobDispatcher] Retrying lock renewal for jobrequest 79794. Job is valid until 07/09/2024 07:45:56.
2024-07-09T07:38:11.263080643Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] Refresh JobRequest VssConnection to get on a different AFD node.
2024-07-09T07:38:11.263095261Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] EstablishVssConnection
2024-07-09T07:38:11.26309838Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] Establish connection with 30 seconds timeout.
2024-07-09T07:38:11.338495992Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
{
"$id": "1",
"innerException": null,
"message": "The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.",
"typeName": "Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server",
"typeKey": "UnauthorizedRequestException",
"errorCode": 0,
"eventId": 3000
}
not sure why
EKS 1.27 ... the setup is dind
Checks
Controller Version
0.9.1
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
We are using Karpenter to dynamically provision worker nodes for our pipelines. It takes around 2 minutes to provision the node and to start the runner pod on it. However, it seems that the controller attempts to scale the EphemeralRunnerSet down after 1 minute, while the pod is still in the Pending status. It leads to the pipeline being stuck
Might be related to https://github.com/actions/actions-runner-controller/issues/3420 and https://github.com/actions/actions-runner-controller/pull/3426
Describe the expected behavior
Controller does not scale the EphemeralRunnerSet down after 1 minute of runner pod being in the Pending state.
Additional Context
Controller Logs
Runner Pod Logs