akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.7k stars 1.04k forks source link

Node hung in "joining" status #4290

Closed drewjenkel closed 4 years ago

drewjenkel commented 4 years ago

Environment Details: Akka.Net Versions: Akka.Cluster.Tools Version="1.3.17" Akka.DI.AutoFac Version="1.3.9" Akka.DI.Core Version="1.3.17" Akka.Logger.NLog Version="1.3.4" Autofac.Extensions.DependencyInjection Version="5.0.1" Petabridge.Cmd.Cluster Version="0.6.3" Petabridge.Cmd.Remote Version="0.6.3" Petabridge.Tracing.ApplicationInsights Version="0.1.3"

Other Libraries: Microsoft.Extensions.Hosting Version="2.1.0"

.Net Core 2.1 .

Deployed via Docker Container for Kubernetes on AKS

PBM embedded into Docker Image

Active Configuration: akka { actor { provider ="Akka.Cluster.ClusterActorRefProvider, Akka.Cluster" } loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"] remote { dot-netty.tcp { transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote" applied-adapters = [] transport-protocol = tcp port = 7770 hostname = 0.0.0.0 public-hostname = "lighthouse-0.lighthouse" enforce-ip-family = true dns-use-ipv6 = false } log-remote-lifecycle-events=INFO } cluster { seed-nodes=["akka.tcp://andor@lighthouse-0.lighthouse:7770", "akka.tcp://andor@lighthouse-1.lighthouse:7770", "akka.tcp://andor@lighthouse-2.lighthouse:7770"] roles=["lighthouse"] downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster" split-brain-resolver { active-strategy = keep-referee keep-referee { address = "akka.tcp://andor@lighthouse-0.lighthouse:7770" down-all-if-less-than-nodes = 1 } } } petabridge.cmd{ host = "0.0.0.0" port = 9110 log-palettes-on-startup = on }

Environment Size: 60-70 pods at a given time.

On occasion, I will have a pod enter a state where it sees itself a joining, and gets hung. I have attached the logs for the affected service, a lighthouse node, and the cluster leader.

We see messages randomly after a node gets "hung" that it is sending gossip intended for a different node that it receives.

It appears that the node is downed by the leader, but because the hung node is "joining" it's not downing itself, and becomes confused. It never leaves the cluster, and continues to gossip.

Trying to PBM (into the infected node), I cannot down, join, exit, or anything. The only way to recover has been to delete the pod or scale down the statefulset. akkalogs.zip

I believe this is similar to Issues:

https://github.com/akkadotnet/akka.net/issues/2584 https://github.com/akkadotnet/akka.net/issues/3274 https://github.com/akkadotnet/akka.net/issues/2346

drewjenkel commented 4 years ago

An update, We switched the AKS Kubernetes networking off the Advanced CNI and onto K8s standard networking which resolved a separate issue. We are monitoring our clusters to see if this issue returns.

drewjenkel commented 4 years ago

I can confirm now this is still happening. I have more verbose logs of the event, and I'm attaching it.

drewjenkel commented 4 years ago

cluster_error_verbose.txt

drewjenkel commented 4 years ago

2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed 2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed 2020-03-09 17:53:31.5790| INFO|Cluster Node [akka.tcp://andor@andor-patient-data-0.andor-patient-data:7770] - Welcome from [akka.tcp://andor@lighthouse-0.lighthouse:7770]

Aaronontheweb commented 4 years ago

2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] faile

Looks like that eventually succeeded in the end - but I'll double check that. Is this a seed node logging this issue or a joining node?

drewjenkel commented 4 years ago

Right. That's what is odd. It is a joining node, not a seed node.

Here's a little more on it. The pattern goes like this. (Still happening, but I have mitigating controls for the moment while I'm still looking for it.)

Node comes online. Node gets welcome message. Everything is good for a few seconds, then one at a time, I get disassociated message, followed by a change of node leader, until it does that with every other node in the cluster. (The other nodes don't see a leader change. )

Eventually, every node except itself is "unavailable", and if we look, it's stuck as "Joining". It continues spamming all the other nodes with gossip which give the Irrevocably failed, must be restarted, yada yada.

I have a node in my environment right now that continues to either do this, OR if it gets to join, it crashes within a couple minutes, but the same behavior is observed.

Does that make sense?

Aaronontheweb commented 4 years ago

What's the load looking like in the K8s cluster when this happens?

drewjenkel commented 4 years ago

I have 6 int the node pool.

CPU average around 32%. Other stats are nominal.

Any specific stats help?

drewjenkel commented 4 years ago

FYI, I have one in this state right this minute.

Aaronontheweb commented 4 years ago

if there's any K8s limits on the number of open connections in your cluster, that would be interesting data to have too. But looks like it's not a load issue

Sent from my iPhone

On Mar 10, 2020, at 6:39 PM, Drew Jenkel notifications@github.com wrote:

 I have 6 int the node pool.

CPU average around 32%. Other stats are nominal.

Any specific stats help?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

drewjenkel commented 4 years ago

I'm looking into if K8s does something CPU or Connection by default at the moment.

drewjenkel commented 4 years ago

I’m starting to wonder if https://github.com/Azure/AKS/issues/1373 is the underlying issue, and we are seeing network and dns errors. I’ve seen a number of dns service name unknown exceptions

Aaronontheweb commented 4 years ago

@drewjenkel should we mark this as closed then?

drewjenkel commented 4 years ago

We can close. I have an active ticket open with Microsoft. I believe I have this 99% figured out. If it turns up to be an Akka Issue, I can reopen, or open a new one and reference. I don't want to clog your issues.

Aaronontheweb commented 4 years ago

Thanks! Please let us know.