Seed node couldn't join another seed nodes

carlosrpereira commented 6 years ago

We are facing a situation where a node is unable to join cluster. We have two seed nodes 10.0.0.4 and 10.0.0.5. The node 10.0.0.4 starts and works normally but node 10.0.0.5 does not join the cluster. Both servers are on the same LAN.

From node 10.0.0.5

2018-07-03 22:53:09.920 +00:00 [Warning][akka://Cluster/system/cluster/core/daemon/joinSeedNodeProcess-1][#0024] Couldn't join seed nodes after [727] attempts, will try again. seed-nodes=["akka.tcp://Cluster@10.0.0.4:10000"]
2018-07-03 22:53:14.931 +00:00 [Warning][akka://Cluster/system/cluster/core/daemon/joinSeedNodeProcess-1][#0024] Couldn't join seed nodes after [728] attempts, will try again. seed-nodes=["akka.tcp://Cluster@10.0.0.4:10000"]
...
2018-07-03 22:56:45.350 +00:00 [Warning][akka://Cluster/system/cluster/core/daemon/joinSeedNodeProcess-1][#0017] Couldn't join seed nodes after [770] attempts, will try again. seed-nodes=["akka.tcp://Cluster@10.0.0.4:10000"]

Network connectivity is ok:

@ip-10-0-0-5:~$ ping 10.0.0.4
PING 10.0.0.4 (10.0.0.4) 56(84) bytes of data.
64 bytes from 10.0.0.4: icmp_seq=1 ttl=64 time=0.601 ms
64 bytes from 10.0.0.4: icmp_seq=2 ttl=64 time=0.589 ms
64 bytes from 10.0.0.4: icmp_seq=3 ttl=64 time=0.581 ms

Testing port 10000 ...

ip-10-0-0-5:~$ telnet 10.0.0.4 10000
Trying 10.0.0.4...
Connected to 10.0.0.4.

Seed 10.0.0.4 is active and working other nodes:

[10.0.0.4:10150] pbm> cluster show
akka.tcp://Cluster@10.0.0.4:10000 | [seed] | up |
akka.tcp://Cluster@10.0.10.202:10050 | [api] | up |
akka.tcp://Cluster@10.0.10.202:10051 | [api] | up |
akka.tcp://Cluster@10.0.10.202:10052 | [backend] | up |
akka.tcp://Cluster@10.0.132.195:10050 | [services] | up |
akka.tcp://Cluster@10.0.135.181:10050 | [services] | up |
akka.tcp://Cluster@10.0.138.201:10050 | [services] | up |
akka.tcp://Cluster@10.0.143.124:10050 | [services] | up |
akka.tcp://Cluster@10.0.165.231:10050 | [services] | up |
akka.tcp://Cluster@10.0.181.190:10050 | [services] | up |
akka.tcp://Cluster@10.0.183.155:10050 | [services] | up |
akka.tcp://Cluster@10.0.183.63:10050 | [services] | up |
akka.tcp://Cluster@10.0.184.21:10050 | [services] | up |
akka.tcp://Cluster@10.0.185.183:10050 | [services] | up |
Count: 14 nodes

Akka configuration:

akka : {
   stdout-loglevel : DEBUG
   loglevel : DEBUG
   log-config-on-start : on
   actor : {
     debug : {
       receive : on
       autoreceive : on
       lifecycle : on
       event-stream : on
       unhandled : on
     }
   }
   remote : {
     log-received-messages : on
     log-sent-messages : on
     gremlin : {
       debug : on
     }
     dot-netty : {
       tcp : {
         hostname : 10.0.0.5
         port : 10000
       }
     }
   }
   cluster : {
     allow-weakly-up-members : on
     seed-nodes : ["akka.tcp://Cluster@10.0.0.4:10000","akka.tcp://Cluster@10.0.0.5:10000"]
     roles : [seed]
     debug : {
       verbose-receive-gossip-logging : on
     }
   }
 }

Versioning

Akka .Net version: 1.3.8
Linux + Docker + .Net Core 2.1

Both 10.0.0.4 and 10.0.0.5 are runnig exactly the same docker container version

Sometimes rebooting both machines the nodes are able to join the cluster but after a while the message starts appearing again

jonhoare commented 6 years ago

Is this still an issue for you?

We have almost identical issues running our system in docker. Our setup is, 1x Seed Node (Lighthouse) 2x App1 Applications (Identical Docker Images/Versions) 2x App2 Applications (Identical Docker Images/Versions) 2x App3 Applications (Identical Docker Images/Versions)

Sometimes all apps will connect and join the cluster correctly using the Lighthouse Seed Node. However, I then see issues where we may have 2x App1, 2x App2, and only 1x App3 applications joined to the cluster with no other useful information.

We have tried scaling back to just 1x instance of each App1,2,3 and that sometimes all connects fine, and other times just get App1 and App3 and App2 fail to join (Could be any of the Apps that fail).

When we get this issue, I have checked that I can ping and telnet to my Seed Node from my failed App Container, and I can also ping/telnet back to my failed App Container from my Seed Node. So it doesn't appear to be a networking issue.

Horusiath commented 6 years ago

@jonhoare are you starting all of these nodes at the same time? If so, maybe just give seed node a headstart and let it get up before trying to join it?

jonhoare commented 6 years ago

@Horusiath Thanks for the heads up.

We have had our lighthouse application up and running for around 45 minutes now, and have just recycled the pods (we are using kubernetes) that were not connected. Some of these pods were able to connect successfully, but a couple of others still are not.

We get the following warning message in our application pod that is failing to join the Lighthouse seed node.

[15:03:43 WRN] Couldn't join seed nodes after [760] attempts, will try again. seed-nodes=[akka.tcp://myapp@myapp-lighthouse-0.myapp-lighthouse:4053]

Then in the Lighthouse we don't appear to see any warnings about a node failing to join/communicate.

Again networking wise both pods can see/communicate/talk on ip/port dns/port.

Is there anything else you can suggest trying, other than restarting a pod when it fails to communicate with the cluster?

theladyjaye commented 6 years ago

I have the same issue when running inside docker. (Akka v1.3.9)

In my simple setup, if I run 1x seed node, my other services seem to connect just fine. When I introduce a second seed node however, things become very unstable. as @carlosrpereira did, I can also, from the Seed #2 both ping AND telnet to Seed #1, but Seed #2 will not connect to Seed #1:

[WARNING][09/28/2018 16:50:26][Thread 0010][[akka://Attraction/system/cluster/core/daemon/joinSeedNodeProcess-1#1499972846]] Couldn't join seed nodes after [191] attempts, will try again. seed-nodes=[akka.tcp://MySystem@172.23.0.2:4053]

So, again, pinging 172.23.0.2 from Seed #2 and telnetting to 172.23.0.2:4053 works from Seed #2 but it refuses to connect.

If I run the same processes not in docker, everything seems to work just fine 100% of the time.

I don't know if this helps, but I have an observation:

So I set the hostname to 0.0.0.0 inside the docker container, and I set the public-hostname to my host machine's IP, things seem to work much faster, and at least connect for the most part. I need to explicitly set the cluster port for the Akka services for this to work, but I'll take working over not working any day =)

theladyjaye commented 5 years ago

Following up to see if anyone has gotten Akka.Net to successfully run inside docker. If I run a demo setup outside of docker (1 Lighthouse, 3 Services) everything works great. When I run them all inside a single container, bound to localhost inside the container Cluster seed node akka.tcp://Cluster@localhost:4053, some services join the cluster, others do not.

Under this setup I can telnet to localhost:4053 and I connect, but I get the messages from the services saying they cannot react the seed node. Again all the processes are running inside the same container so I would think there would not be any network issues.

Ahhhh maybe this is it then:

https://github.com/akkadotnet/akka.net/issues/3506

I am on NET Core 2.1, and that seems to be related, my issue only happens in docker though. Running on the microsoft/dotnet image

jonhoare commented 5 years ago

@aventurella I do have this working in Docker as mentioned above, however we do notice that our apps occasionally fail to connect to lighthouse or they disconnect and are unable to reconnect again.

For us, we believe this is to do with an open bug with the dotnetty package used by Akka.net which causes this when running Akka.net using dotnetcore2.1+.

https://github.com/akkadotnet/akka.net/issues/3506

When we dropped our dotnetcore version to 2.0, the issue appears to go away and we are waiting a fix that will allow us to move back to using 2.1+.

Not sure if this helps for you, but could be worth checking/trying?

theladyjaye commented 5 years ago

Oh definitely, I'll gladly downgrade to give that a try, where did you grab the DotNetty 2.0 image from? I'm looking here:

https://github.com/dotnet/dotnet-docker https://hub.docker.com/r/microsoft/dotnet/

But only see 2.1

theladyjaye commented 5 years ago

I can just make my own image:

https://www.microsoft.com/net/download/linux-package-manager/debian9/runtime-2.0.9

jonhoare commented 5 years ago

I mean I am still using the petabridge lighthouse docker image, but I had to ensure my Akka projects were only using dotnet 2.0 packages and my docker image is using Microsoft/dotnet:2.0.9-runtime.

theladyjaye commented 5 years ago

Adding on to this, I seem to only experience the issue when running inside Docker. For example, I am successfully running:

1 Seed node 4 Services

bound to localhost, not inside docker on CentOS 7, NET Core 2.1 without an issue yet.

Aaronontheweb commented 5 years ago

This is a duplicate of #3506 - it's the same issue. Network stack on .NET Core 2.1 introduced a regression on Linux. Working with the DotNetty team to get to the bottom of it.

Aaronontheweb commented 5 years ago

bound to localhost, not inside docker on CentOS 7, NET Core 2.1 without an issue yet.

That's an interesting piece of information... Quick question, does your CentOS environment default to IPv4 or IPv6? A possibility that @stormhub raised in the DotNetty chat room is that the issue could be related to DNS resolution on IPv6 not being fully supported on Linux. We've run into that issue before in the past with Mono and with even bare metal work on Linux distributions.

Unless I'm reading this incorrectly https://docs.docker.com/config/daemon/ipv6/, looks to me like Docker defaults to IPV4 which should be fine...

sean-gilliam commented 5 years ago

According to #3506, this has been resolved via #3633

If you are still experiencing an issue, please upgrade to the latest version of Akka.NET.

akkadotnet / akka.net

Seed node couldn't join another seed nodes #3534