0.9.1 does not work with on-demand provisioned nodes

wasd171 commented 5 months ago

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.9.1

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Configure Karpenter to schedule workload on a separate node pool
2. Run the pipeline

Describe the bug

We are using Karpenter to dynamically provision worker nodes for our pipelines. It takes around 2 minutes to provision the node and to start the runner pod on it. However, it seems that the controller attempts to scale the EphemeralRunnerSet down after 1 minute, while the pod is still in the Pending status. It leads to the pipeline being stuck

Describe the expected behavior

Controller does not scale the EphemeralRunnerSet down after 1 minute of runner pod being in the Pending state.

Additional Context

```yaml # values.yaml arcOperator: values: replicaCount: 3 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: gha-rs-controller topologyKey: kubernetes.io/hostname arcRunners: values: githubConfigUrl: "https://github.com/CareerLunch/career-lunch" githubConfigSecret: github_token: {{ requiredEnv "ARC_CHART_GITHUB_TOKEN" | quote }} containerMode: type: kubernetes kubernetesModeWorkVolumeClaim: accessModes: ["ReadWriteOnce"] storageClassName: "career-lunch-bootstrap-aws-ebs-gp3" resources: requests: storage: 10Gi template: metadata: annotations: karpenter.sh/do-not-evict: "true" spec: securityContext: fsGroup: 123 nodeSelector: workload: arc_runners_v1 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: actions-ephemeral-runner: "True" topologyKey: kubernetes.io/hostname tolerations: - key: workload operator: Equal value: arc_runners_v1 effect: NoSchedule containers: - name: runner image: ghcr.io/actions/actions-runner:latest command: ["/home/runner/run.sh"] env: - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER value: "false" - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE value: /mnt/pod-template/pod-template.yml volumeMounts: - name: pod-template mountPath: /mnt/pod-template readOnly: true resources: requests: memory: "250Mi" cpu: "100m" limits: memory: "250Mi" volumes: - name: pod-template configMap: name: career-lunch-bootstrap-pod-template --- # pod-template.yml {{- $name := ( include "career-lunch-library.fullname" . ) }} {{- $labels := ( include "career-lunch-library.labels" . ) }} {{- /* See https://github.com/actions/runner-container-hooks/blob/main/docs/adrs/0096-hook-extensions.md */}} apiVersion: v1 kind: ConfigMap metadata: name: {{ $name }}-pod-template namespace: {{ .Values.arcRunners.namespace }} labels: {{ $labels | indent 4 }} data: pod-template.yml: | {{- /* Technically, we do not need a PodTemplate, see https://github.com/actions/runner-container-hooks/blob/main/examples/extension.yaml But this communicates the intent better */}} apiVersion: v1 kind: PodTemplate metadata: annotations: karpenter.sh/do-not-evict: "true" spec: serviceAccountName: {{ include "career-lunch-library.fullname" . }}-arc-runners-sa securityContext: {{- /* Provides access to /home/runner/_work directory in ephemeral volume */}} fsGroup: 123 nodeSelector: workload: arc_runners_v1 tolerations: - key: workload operator: Equal value: arc_runners_v1 effect: NoSchedule containers: {{- /* Overrides job container */}} - name: $job resources: {{- /* Memory values are rather high, but we need them for webpack / Cypress jobs */}} requests: memory: 4Gi cpu: 1 {{- /* Best practice is to add limits too However, we run these jobs in isolated instances and want them to use all resources available Instance resources are configured via Karpenter */}} ```

Controller Logs

https://gist.github.com/wasd171/eae06234a06a26592a264cd617711a85
Look at the logs starting with 2024-04-17T17:32:32Z

Listener logs:
2024-04-17T17:32:24Z    INFO    listener-app.listener   Job available message received  {"jobId": 25672}
2024-04-17T17:32:24Z    INFO    listener-app.listener   Acquiring jobs  {"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z    INFO    listener-app.listener   Jobs are acquired   {"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z    INFO    listener-app.listener   Deleting last message   {"lastMessageID": 10}
2024-04-17T17:32:25Z    INFO    listener-app.worker.kubernetesworker    Created merge patch json for EphemeralRunnerSet update  {"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:32:25Z    INFO    listener-app.worker.kubernetesworker    Scaling ephemeral runner set    {"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:25Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:32:25Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 10}
2024-04-17T17:32:31Z    INFO    listener-app.listener   Processing message  {"messageId": 11, "messageType": "RunnerScaleSetJobMessages"}
2024-04-17T17:32:31Z    INFO    listener-app.listener   New runner scale set statistics.    {"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
2024-04-17T17:32:31Z    INFO    listener-app.listener   Job assigned message received   {"jobId": 25672}
2024-04-17T17:32:31Z    INFO    listener-app.listener   Deleting last message   {"lastMessageID": 11}
2024-04-17T17:32:32Z    INFO    listener-app.worker.kubernetesworker    Created merge patch json for EphemeralRunnerSet update  {"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-04-17T17:32:32Z    INFO    listener-app.worker.kubernetesworker    Scaling ephemeral runner set    {"assigned job": 1, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:32Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 1}
2024-04-17T17:32:32Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 11}
2024-04-17T17:33:22Z    INFO    listener-app.worker.kubernetesworker    Created merge patch json for EphemeralRunnerSet update  {"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:33:22Z    INFO    listener-app.worker.kubernetesworker    Scaling ephemeral runner set    {"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-04-17T17:33:22Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:33:22Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 11}

Runner Pod Logs

Unable to provide since the pod is evicted by controller

nikola-jokic commented 5 months ago

Hey @wasd171,

Thank you for reporting this! The root cause of the problem is that in 0.9.1, the assumption was that if there is an empty batch, we self-correct the count. My assumption was that 50s is enough time for the cluster to be ready, but in this case, it was obviously wrong.

An important thing to point out is that it should self correct on the next run on the scale set. However, this should be fixed, so again, thank you for bringing this to our attention! Please, use the older version of ARC in the meantime. You can also use the older version of the listener that doesn't propagate patch ID, which may occasionally create more pods than necessary, but it would handle this case appropriately.

koreyGambill commented 5 months ago

@nikola-jokic , I want to check if this is the same issue I am having.

I have a listener that scheduled 7 jobs to 7 arc-runner pods, but those pods were pending while waiting for a node to come online. After about a minute, all of those pods went away, and then 1-2 minutes later the node was ready for them to run on. The pods didn't come back up, though.

Meanwhile, in github actions, the jobs were stuck waiting for a runner group. The listener logs in arc-systems shows that those jobs were assigned, but the pods they were assigned to had been terminated (probably by the controller) and so the jobs were lost never to be scheduled.

Does this sound like the same thing, and would rolling back to 0.9.0 to address the problem?

nikola-jokic commented 5 months ago

Hey @koreyGambill,

I think so. Rolling back to 0.9.0 should fix the problem. I have to mention that in a very unlucky case, 0.9.0 can increase latency of starting a job. It is a rare case, and if you have a busy cluster, it is very unlikely to happen. However, if you want to be 100% sure everything is executed as quickly as possible in every situation, please rollback to 0.8.3. This version of controller can create and delete ephemeral runners more than 0.9.0, but will ensure that the runner is created as soon as possible. In the next release (0.9.2), the controller should be able to pick up everything right away, and decrease the number of created pods.

ywang-psee commented 5 months ago

If we could set the waiting time for pods in the gha-runner-scale-set-controller values.yaml, that would be great :)

nikola-jokic commented 5 months ago

Hey @ywang-psee,

We don't have a waiting time, it was a bug introduced in 0.9.1 :disappointed: This PR should fix it.

ohookins commented 5 months ago

Think we are having the same problem. Also using Karpenter here, and actually saw the listener create the appropriate ephemeralrunnerset settings, then ephemeralrunners started creating, then actual pods, but they all get killed off very quickly and it's back to 0 before the queued jobs can be picked up.

Have reverted to 0.9.0 and things seem to be working again.

OneideLuizSchneider commented 5 months ago

Yep, same here, we reverted to 0.9.0 as well! ...Keeping an eye on this PR #3204

chenghuang-mdsol commented 4 months ago

@nikola-jokic Hi, do you know when will release 0.9.2 happen ?

nikola-jokic commented 4 months ago

Hey @chenghuang-mdsol,

I cannot promise anything, but hopefully, later this week or next week.

casey-robertson-paypal commented 4 months ago

I need to dig into it but "rolling back" to 0.8.3 results in chart diff errors with regard to clusterrole and the namespace it's in. Chart behavior must have changed with regard to where accounts are created (which namespace). I set the namespace aligned with the quick start docs (arc-runners and arc-systems) - do most people just use kube-system? We are still testing so I can blow it all away but just wondering.

Also for removal - will helm uninstall do this cleanly or is there CRD cleanup required?

nikola-jokic commented 4 months ago

Hey everyone :wave:

Would anyone be interested in testing out this fix before we release it please :pray:

To do it, you can follow these steps:

In the gha-runner-scale-set-controller values.yaml file, update the tag value to: canary-3bda9bb
Update the field appVersion in the Chart.yaml file for gha-runner-scale-set to be: canary-3bda9bb
Redeploy ARC using the updated helm chart and values.yaml files

Thank you in advance! :heart:

sungmincs commented 4 months ago

Thanks @nikola-jokic I just tested with the canary image tag and the helm chart, and looks like it fixed the issue

NAME          NAMESPACE     REVISION    UPDATED                                 STATUS      CHART                                   APP VERSION
arc           arc-systems   1           2024-05-18 00:57:23.367937 +0900 KST    deployed    gha-runner-scale-set-controller-0.9.1   0.9.1
runner-arm64  arc-runners   1           2024-05-18 00:57:25.176608 +0900 KST    deployed    gha-runner-scale-set-0.9.1              canary-3bda9bb

$ kubectl get pod -n arc-runners
NAME                              READY   STATUS    RESTARTS   AGE
runner-arm64-k9mpd-runner-j7g9n   2/2     Running   0          82s

it took roughly 70~80 sec for the node to be ready and for the runner pod to start running, and the controller tolerated the start time.

sungmincs commented 4 months ago

However, one thing I want to call out is that when I firstly tried to install this canary image and chart, the listener failed to boot up with the following missing role.

2024-05-17T15:44:10Z    ERROR   AutoscalingRunnerSet    Failed to remove finalizers on dependent resources  {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}, "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*AutoscalingRunnerSetReconciler).Reconcile
    github.com/actions/actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go:139
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z    ERROR   Reconciler error    {"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"runner-arm64","namespace":"arc-runners"}, "namespace": "arc-runners", "name": "runner-arm64", "reconcileID": "1f35d8c0-b14f-471b-bbc0-c132a7a5196b", "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z    INFO    AutoscalingRunnerSet    Deleting resources  {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}
2024-05-17T15:44:10Z    INFO    AutoscalingRunnerSet    Cleaning up the listener    {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}

Maybe it was just a timing issue that role/rolebinding created after the controller got created, or just a one-off blip because I couldn't reproduce it after all. I thought I would share it just in case anyways.

noamgreen commented 2 months ago

HI , i have same issues in 0.9.3 @nikola-jokic

2024-07-09T07:38:10.595179313Z stdout F [RUNNER 2024-07-09 07:38:10Z ERR  JobDispatcher] Catch exception during renew runner jobrequest 79794.
2024-07-09T07:38:11.246446746Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher] System.TimeoutException: The HTTP request timed out after 00:01:00.
2024-07-09T07:38:11.246461895Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]  ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
2024-07-09T07:38:11.246467637Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
2024-07-09T07:38:11.246487934Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246493242Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
2024-07-09T07:38:11.246497368Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
2024-07-09T07:38:11.246500393Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246503597Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246506861Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    --- End of inner exception stack trace ---
2024-07-09T07:38:11.246509682Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246512302Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246515406Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
2024-07-09T07:38:11.246518675Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
2024-07-09T07:38:11.246521713Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.DistributedTask.WebApi.TaskAgentHttpClient.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken, Func`3 processResponse)
2024-07-09T07:38:11.246530257Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.DistributedTask.WebApi.TaskAgentHttpClient.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken, Func`3 processResponse)
2024-07-09T07:38:11.246533658Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.Runner.Listener.JobDispatcher.RenewJobRequestAsync(IRunnerServer runnerServer, Int32 poolId, Int64 requestId, Guid lockToken, String orchestrationId, TaskCompletionSource`1 firstJobRequestRenewed, CancellationToken token)
2024-07-09T07:38:11.246537357Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher] #####################################################
2024-07-09T07:38:11.246539905Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher] System.Threading.Tasks.TaskCanceledException: A task was canceled.
2024-07-09T07:38:11.246542555Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
2024-07-09T07:38:11.246545359Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246556749Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
2024-07-09T07:38:11.246559539Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
2024-07-09T07:38:11.246562315Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246565222Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR  JobDispatcher]    at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246568123Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO JobDispatcher] Retrying lock renewal for jobrequest 79794. Job is valid until 07/09/2024 07:45:56.
2024-07-09T07:38:11.263080643Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] Refresh JobRequest VssConnection to get on a different AFD node.
2024-07-09T07:38:11.263095261Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] EstablishVssConnection
2024-07-09T07:38:11.26309838Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] Establish connection with 30 seconds timeout.
2024-07-09T07:38:11.338495992Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO GitHubActionsService] Starting operation Location.GetConnectionData

{
  "$id": "1",
  "innerException": null,
  "message": "The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.",
  "typeName": "Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server",
  "typeKey": "UnauthorizedRequestException",
  "errorCode": 0,
  "eventId": 3000
}

not sure why

EKS 1.27 ... the setup is dind

actions / actions-runner-controller