Closed jeffmccune closed 3 months ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
This bug may be caused by the hashing method. The pod name is always hashed to the same value regardless of the cluster.
❯ KUBECONFIG=$HOME/.kube/k2 k get pods -n arc-system
NAME READY STATUS RESTARTS AGE
gha-rs-7db9c9f7-listener 0/1 ContainerCreating 0 0s
gha-rs-controller-6897c9bffb-trdc2 1/1 Running 0 2d18h
❯ KUBECONFIG=$HOME/.kube/k3 k get pods -n arc-system
NAME READY STATUS RESTARTS AGE
gha-rs-7db9c9f7-listener 0/1 ContainerCreating 0 1s
gha-rs-controller-6897c9bffb-4jx2m 1/1 Running 0 3d2h
❯ KUBECONFIG=$HOME/.kube/k4 k get pods -n arc-system
NAME READY STATUS RESTARTS AGE
gha-rs-7db9c9f7-listener 1/1 Running 0 25h
gha-rs-controller-6897c9bffb-7gp68 1/1 Running 0 2d19h
Bug looks to be here.
Workaround is to create different controller namespaces in different clusters, but this unfortunately isn't a permanent solution because it doesn't square with the position of SIG Multicluster.
Hey @jeffmccune,
Since your scale sets are named the same, do they belong to different runner groups? If not, that is the problem.
Hi @nikola-jokic thanks for following up.
No, we need to have at least two clusters in the same group for multi-region redundancy. Why is it a problem? There is one scale set spanning N>1 regions.
What I'm trying to accomplish is to allow dev teams to have a default runs-on target that runs on any available cluster in any available region. We regularly take down entire clusters, so having a dev team target a specific region or cluster isn't ideal, their workflows would fail when we take the cluster down even though other clusters are available to run the workflow.
What's the recommended way to have a workflow target any available scale set in any available cluster (region)?
Oh you can't have two scale sets with the same name belonging to the same runner group. That is the reason of this report, because that scale set already has a session opened. I hope this document can help you.
Previously it was straightforward to spin up N>1 clusters with self hosted runners with a label of "self-hosted" and jobs would execute globally. There was no unnecessary coordination between teams of, "cluster X is going down for maintenance, update your workflows."
How can this similar level of availability be achieved? Is there a way to configure the workflow to run on any one of X, Y, or Z runner sets?
Sorry if I misunderstood your question, but these are two points I think you are asking:
Closing this one as answered, but feel free to comment on it :relaxed:
Thanks for taking the time to answer this. You're correct about my two questions. The only comment I have is that I'm a bit frustrated because with scale sets, groups are required to achieve the same behavior that was previously supported with labels. My frustration is that groups are a paid enterprise feature but labels are not, so this feels like a step backwards.
I'm a paying customer, but some of the github orgs I work with are not, so they cannot use groups and as a result cannot deploy highly available scale sets. Please consider adding back some mechanism to have a workflow execute on any available cluster without requiring a paid feature like was possible previously. Thanks again for your time responding to the question.
No problem, thank you for your feedback! Would you be so kind as to put it in the discussion here? In this discussion, people are expressing their thoughts on our single-label approach, and your feedback would be valuable. Thanks :relaxed:!
Checks
Controller Version
0.8.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
The listener fails to start.
Describe the expected behavior
The listener should start in both clusters.
Additional Context
Controller Logs
Runner Pod Logs