Closed drewjenkel closed 4 years ago
An update, We switched the AKS Kubernetes networking off the Advanced CNI and onto K8s standard networking which resolved a separate issue. We are monitoring our clusters to see if this issue returns.
I can confirm now this is still happening. I have more verbose logs of the event, and I'm attaching it.
2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed 2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed 2020-03-09 17:53:31.5790| INFO|Cluster Node [akka.tcp://andor@andor-patient-data-0.andor-patient-data:7770] - Welcome from [akka.tcp://andor@lighthouse-0.lighthouse:7770]
2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] faile
Looks like that eventually succeeded in the end - but I'll double check that. Is this a seed node logging this issue or a joining node?
Right. That's what is odd. It is a joining node, not a seed node.
Here's a little more on it. The pattern goes like this. (Still happening, but I have mitigating controls for the moment while I'm still looking for it.)
Node comes online. Node gets welcome message. Everything is good for a few seconds, then one at a time, I get disassociated message, followed by a change of node leader, until it does that with every other node in the cluster. (The other nodes don't see a leader change. )
Eventually, every node except itself is "unavailable", and if we look, it's stuck as "Joining". It continues spamming all the other nodes with gossip which give the Irrevocably failed, must be restarted, yada yada.
I have a node in my environment right now that continues to either do this, OR if it gets to join, it crashes within a couple minutes, but the same behavior is observed.
Does that make sense?
What's the load looking like in the K8s cluster when this happens?
I have 6 int the node pool.
CPU average around 32%. Other stats are nominal.
Any specific stats help?
FYI, I have one in this state right this minute.
if there's any K8s limits on the number of open connections in your cluster, that would be interesting data to have too. But looks like it's not a load issue
Sent from my iPhone
On Mar 10, 2020, at 6:39 PM, Drew Jenkel notifications@github.com wrote:
I have 6 int the node pool.
CPU average around 32%. Other stats are nominal.
Any specific stats help?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I'm looking into if K8s does something CPU or Connection by default at the moment.
I’m starting to wonder if https://github.com/Azure/AKS/issues/1373 is the underlying issue, and we are seeing network and dns errors. I’ve seen a number of dns service name unknown exceptions
@drewjenkel should we mark this as closed then?
We can close. I have an active ticket open with Microsoft. I believe I have this 99% figured out. If it turns up to be an Akka Issue, I can reopen, or open a new one and reference. I don't want to clog your issues.
Thanks! Please let us know.
Environment Details: Akka.Net Versions: Akka.Cluster.Tools Version="1.3.17" Akka.DI.AutoFac Version="1.3.9" Akka.DI.Core Version="1.3.17" Akka.Logger.NLog Version="1.3.4" Autofac.Extensions.DependencyInjection Version="5.0.1" Petabridge.Cmd.Cluster Version="0.6.3" Petabridge.Cmd.Remote Version="0.6.3" Petabridge.Tracing.ApplicationInsights Version="0.1.3"
Other Libraries: Microsoft.Extensions.Hosting Version="2.1.0"
.Net Core 2.1 .
Deployed via Docker Container for Kubernetes on AKS
PBM embedded into Docker Image
Active Configuration: akka { actor { provider ="Akka.Cluster.ClusterActorRefProvider, Akka.Cluster" } loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"] remote { dot-netty.tcp { transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote" applied-adapters = [] transport-protocol = tcp port = 7770 hostname = 0.0.0.0 public-hostname = "lighthouse-0.lighthouse" enforce-ip-family = true dns-use-ipv6 = false } log-remote-lifecycle-events=INFO } cluster { seed-nodes=["akka.tcp://andor@lighthouse-0.lighthouse:7770", "akka.tcp://andor@lighthouse-1.lighthouse:7770", "akka.tcp://andor@lighthouse-2.lighthouse:7770"] roles=["lighthouse"] downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster" split-brain-resolver { active-strategy = keep-referee keep-referee { address = "akka.tcp://andor@lighthouse-0.lighthouse:7770" down-all-if-less-than-nodes = 1 } } } petabridge.cmd{ host = "0.0.0.0" port = 9110 log-palettes-on-startup = on }
Environment Size: 60-70 pods at a given time.
On occasion, I will have a pod enter a state where it sees itself a joining, and gets hung. I have attached the logs for the affected service, a lighthouse node, and the cluster leader.
We see messages randomly after a node gets "hung" that it is sending gossip intended for a different node that it receives.
It appears that the node is downed by the leader, but because the hung node is "joining" it's not downing itself, and becomes confused. It never leaves the cluster, and continues to gossip.
Trying to PBM (into the infected node), I cannot down, join, exit, or anything. The only way to recover has been to delete the pod or scale down the statefulset. akkalogs.zip
I believe this is similar to Issues:
https://github.com/akkadotnet/akka.net/issues/2584 https://github.com/akkadotnet/akka.net/issues/3274 https://github.com/akkadotnet/akka.net/issues/2346