Closed carlosrpereira closed 5 years ago
Is this still an issue for you?
We have almost identical issues running our system in docker. Our setup is, 1x Seed Node (Lighthouse) 2x App1 Applications (Identical Docker Images/Versions) 2x App2 Applications (Identical Docker Images/Versions) 2x App3 Applications (Identical Docker Images/Versions)
Sometimes all apps will connect and join the cluster correctly using the Lighthouse Seed Node. However, I then see issues where we may have 2x App1, 2x App2, and only 1x App3 applications joined to the cluster with no other useful information.
We have tried scaling back to just 1x instance of each App1,2,3 and that sometimes all connects fine, and other times just get App1 and App3 and App2 fail to join (Could be any of the Apps that fail).
When we get this issue, I have checked that I can ping and telnet to my Seed Node from my failed App Container, and I can also ping/telnet back to my failed App Container from my Seed Node. So it doesn't appear to be a networking issue.
@jonhoare are you starting all of these nodes at the same time? If so, maybe just give seed node a headstart and let it get up before trying to join it?
@Horusiath Thanks for the heads up.
We have had our lighthouse application up and running for around 45 minutes now, and have just recycled the pods (we are using kubernetes) that were not connected. Some of these pods were able to connect successfully, but a couple of others still are not.
We get the following warning message in our application pod that is failing to join the Lighthouse seed node.
[15:03:43 WRN] Couldn't join seed nodes after [760] attempts, will try again. seed-nodes=[akka.tcp://myapp@myapp-lighthouse-0.myapp-lighthouse:4053]
Then in the Lighthouse we don't appear to see any warnings about a node failing to join/communicate.
Again networking wise both pods can see/communicate/talk on ip/port dns/port.
Is there anything else you can suggest trying, other than restarting a pod when it fails to communicate with the cluster?
I have the same issue when running inside docker. (Akka v1.3.9)
In my simple setup, if I run 1x seed node, my other services seem to connect just fine. When I introduce a second seed node however, things become very unstable. as @carlosrpereira did, I can also, from the Seed #2
both ping AND telnet to Seed #1
, but Seed #2
will not connect to Seed #1
:
[WARNING][09/28/2018 16:50:26][Thread 0010][[akka://Attraction/system/cluster/core/daemon/joinSeedNodeProcess-1#1499972846]] Couldn't join seed nodes after [191] attempts, will try again. seed-nodes=[akka.tcp://MySystem@172.23.0.2:4053]
So, again, pinging 172.23.0.2
from Seed #2
and telnetting to 172.23.0.2:4053
works from Seed #2
but it refuses to connect.
If I run the same processes not in docker, everything seems to work just fine 100% of the time.
I don't know if this helps, but I have an observation:
So I set the hostname to 0.0.0.0
inside the docker container, and I set the public-hostname
to my host machine's IP, things seem to work much faster, and at least connect for the most part. I need to explicitly set the cluster port for the Akka services for this to work, but I'll take working over not working any day =)
Following up to see if anyone has gotten Akka.Net to successfully run inside docker. If I run a demo setup outside of docker (1 Lighthouse, 3 Services) everything works great. When I run them all inside a single container, bound to localhost
inside the container Cluster seed node akka.tcp://Cluster@localhost:4053
, some services join the cluster, others do not.
Under this setup I can telnet to localhost:4053
and I connect, but I get the messages from the services saying they cannot react the seed node. Again all the processes are running inside the same container so I would think there would not be any network issues.
Ahhhh maybe this is it then:
https://github.com/akkadotnet/akka.net/issues/3506
I am on NET Core 2.1, and that seems to be related, my issue only happens in docker though. Running on the microsoft/dotnet
image
@aventurella I do have this working in Docker as mentioned above, however we do notice that our apps occasionally fail to connect to lighthouse or they disconnect and are unable to reconnect again.
For us, we believe this is to do with an open bug with the dotnetty package used by Akka.net which causes this when running Akka.net using dotnetcore2.1+.
https://github.com/akkadotnet/akka.net/issues/3506
When we dropped our dotnetcore version to 2.0, the issue appears to go away and we are waiting a fix that will allow us to move back to using 2.1+.
Not sure if this helps for you, but could be worth checking/trying?
Oh definitely, I'll gladly downgrade to give that a try, where did you grab the DotNetty 2.0 image from? I'm looking here:
https://github.com/dotnet/dotnet-docker https://hub.docker.com/r/microsoft/dotnet/
But only see 2.1
I can just make my own image:
https://www.microsoft.com/net/download/linux-package-manager/debian9/runtime-2.0.9
I mean I am still using the petabridge lighthouse docker image, but I had to ensure my Akka projects were only using dotnet 2.0 packages and my docker image is using Microsoft/dotnet:2.0.9-runtime.
Adding on to this, I seem to only experience the issue when running inside Docker. For example, I am successfully running:
1 Seed node 4 Services
bound to localhost, not inside docker on CentOS 7, NET Core 2.1 without an issue yet.
This is a duplicate of #3506 - it's the same issue. Network stack on .NET Core 2.1 introduced a regression on Linux. Working with the DotNetty team to get to the bottom of it.
bound to localhost, not inside docker on CentOS 7, NET Core 2.1 without an issue yet.
That's an interesting piece of information... Quick question, does your CentOS environment default to IPv4 or IPv6? A possibility that @stormhub raised in the DotNetty chat room is that the issue could be related to DNS resolution on IPv6 not being fully supported on Linux. We've run into that issue before in the past with Mono and with even bare metal work on Linux distributions.
Unless I'm reading this incorrectly https://docs.docker.com/config/daemon/ipv6/, looks to me like Docker defaults to IPV4 which should be fine...
According to #3506, this has been resolved via #3633
If you are still experiencing an issue, please upgrade to the latest version of Akka.NET.
We are facing a situation where a node is unable to join cluster. We have two seed nodes 10.0.0.4 and 10.0.0.5. The node 10.0.0.4 starts and works normally but node 10.0.0.5 does not join the cluster. Both servers are on the same LAN.
From node 10.0.0.5
Network connectivity is ok:
Testing port 10000 ...
Seed 10.0.0.4 is active and working other nodes:
Akka configuration:
Versioning
Both 10.0.0.4 and 10.0.0.5 are runnig exactly the same docker container version
Sometimes rebooting both machines the nodes are able to join the cluster but after a while the message starts appearing again