actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.42k stars 1.04k forks source link

Runner Scale Set gets stuck, crash loops every 3 seconds: failed to create session: 409 had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-7db9c9f7-listener #3351

Closed jeffmccune closed 3 months ago

jeffmccune commented 3 months ago

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

To Reproduce

1. Install on west coast cluster
2. Install on east coast cluster
3. Create one `AutoscalingRunnerSet` named `gha-rs` in each of the two clusters.
4. Observe one of the listener pods becomes deadlocked with error `Application returned an error: createSession failed: failed to create session: 409 - had issue communicating with Actions backend: The runner scale set gha-rs already has an active session for owner gha-rs-deadbeef-listener.`

Describe the bug

The listener fails to start.

Describe the expected behavior

The listener should start in both clusters.

Additional Context

The values.yaml used is as close to the upstream documentation as possible.  The only customization is:

values: {
    controllerServiceAccount: name:      "gha-rs-controller"
    controllerServiceAccount: namespace: "arc-system"
    githubConfigSecret: "controller-manager"
    githubConfigUrl:    "https://github.com/myorg"
}

Where the `controller-manager` secret contains GitHub App credentials.

Controller Logs

https://gist.github.com/jeffmccune/e893d4af28727d55979f75fcfddc6536#file-controller-logs-txt

Runner Pod Logs

https://gist.github.com/jeffmccune/e893d4af28727d55979f75fcfddc6536#file-listener-pod-logs-txt
github-actions[bot] commented 3 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

jeffmccune commented 3 months ago

This bug may be caused by the hashing method. The pod name is always hashed to the same value regardless of the cluster.

❯ KUBECONFIG=$HOME/.kube/k2 k get pods -n arc-system
NAME                                 READY   STATUS              RESTARTS   AGE
gha-rs-7db9c9f7-listener             0/1     ContainerCreating   0          0s
gha-rs-controller-6897c9bffb-trdc2   1/1     Running             0          2d18h
❯ KUBECONFIG=$HOME/.kube/k3 k get pods -n arc-system
NAME                                 READY   STATUS              RESTARTS   AGE
gha-rs-7db9c9f7-listener             0/1     ContainerCreating   0          1s
gha-rs-controller-6897c9bffb-4jx2m   1/1     Running             0          3d2h
❯ KUBECONFIG=$HOME/.kube/k4 k get pods -n arc-system
NAME                                 READY   STATUS    RESTARTS   AGE
gha-rs-7db9c9f7-listener             1/1     Running   0          25h
gha-rs-controller-6897c9bffb-7gp68   1/1     Running   0          2d19h
jeffmccune commented 3 months ago

Bug looks to be here.

Workaround is to create different controller namespaces in different clusters, but this unfortunately isn't a permanent solution because it doesn't square with the position of SIG Multicluster.

nikola-jokic commented 3 months ago

Hey @jeffmccune,

Since your scale sets are named the same, do they belong to different runner groups? If not, that is the problem.

jeffmccune commented 3 months ago

Hi @nikola-jokic thanks for following up.

No, we need to have at least two clusters in the same group for multi-region redundancy. Why is it a problem? There is one scale set spanning N>1 regions.

jeffmccune commented 3 months ago

What I'm trying to accomplish is to allow dev teams to have a default runs-on target that runs on any available cluster in any available region. We regularly take down entire clusters, so having a dev team target a specific region or cluster isn't ideal, their workflows would fail when we take the cluster down even though other clusters are available to run the workflow.

What's the recommended way to have a workflow target any available scale set in any available cluster (region)?

nikola-jokic commented 3 months ago

Oh you can't have two scale sets with the same name belonging to the same runner group. That is the reason of this report, because that scale set already has a session opened. I hope this document can help you.

jeffmccune commented 3 months ago

Previously it was straightforward to spin up N>1 clusters with self hosted runners with a label of "self-hosted" and jobs would execute globally. There was no unnecessary coordination between teams of, "cluster X is going down for maintenance, update your workflows."

How can this similar level of availability be achieved? Is there a way to configure the workflow to run on any one of X, Y, or Z runner sets?

nikola-jokic commented 3 months ago

Sorry if I misunderstood your question, but these are two points I think you are asking:

  1. You want to specify a scale set and deploy it on 3 clusters, for example. Workflow should pick one that is available. If this is the case, you name each scale set the same, but you put it in 3 different runner groups. Then, the scale set that is up and that is the fastest to acquire the job will take it and run it, so even if you take one of your clusters down, two that are left will keep taking jobs
  2. If you want to specify a family of scale sets, for example scale-set-1 and scale-set-2, and you want to specify workflow that can run either on scale-set-1 or scale-set-2, that is not currently possible.
nikola-jokic commented 3 months ago

Closing this one as answered, but feel free to comment on it :relaxed:

jeffmccune commented 3 months ago

Thanks for taking the time to answer this. You're correct about my two questions. The only comment I have is that I'm a bit frustrated because with scale sets, groups are required to achieve the same behavior that was previously supported with labels. My frustration is that groups are a paid enterprise feature but labels are not, so this feels like a step backwards.

I'm a paying customer, but some of the github orgs I work with are not, so they cannot use groups and as a result cannot deploy highly available scale sets. Please consider adding back some mechanism to have a workflow execute on any available cluster without requiring a paid feature like was possible previously. Thanks again for your time responding to the question.

nikola-jokic commented 3 months ago

No problem, thank you for your feedback! Would you be so kind as to put it in the discussion here? In this discussion, people are expressing their thoughts on our single-label approach, and your feedback would be valuable. Thanks :relaxed:!