akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.7k stars 1.04k forks source link

Akka.Cluster: nodes stuck at joining #2584

Closed Aaronontheweb closed 7 years ago

Aaronontheweb commented 7 years ago

Seen an issue be reported on clusters that have at least 20ish nodes of new members being stuck in the Joining stage. However, those new members are able to join when:

It looks to me like the issue, overall, is that there's an edge case where the leader stops responding to some part of the join operation and gets stuck. Been looking into this for a little while but I have more data now and thought to report an issue for other users who've observed this happening.

nvivo commented 7 years ago

I'm experiencing the same issue. A basic test so far showed that until 15 nodes all went up correctly. On the 16th it got stuck at JOINING. Once I started another node, the previous one went UP and the last one got stuck at JOINING.

nvivo commented 7 years ago

@Aaronontheweb Is there any dev build with this fixed, maybe a 1.2.1 preview? I'd be glad to run it here to see if this is fixed.

oeaoaueaa commented 7 years ago

As spoken previously, this bug is stopping us from enabling Akka.net usage in production, so we will be very happy to test the fix in our environments with 25-50 nodes.

nvivo commented 7 years ago

I'm waiting patiently for 1.3, but for now I decided to stop relying on akka cluster for communication between nodes for critical stuff, as waiting forever for nodes to join was causing us problems. I'm currently back to db for work syncronization until cluster is more reliable. I'm able to test pre-releases, but I guess 1.3 will be a big one so I'll need to check if there are other impacts before trying it.

Aaronontheweb commented 7 years ago

@oeaoaueaa @nvivo my apologies guys, I've been working on this issue but it's been very difficult to reproduce. If you have any additional logs that you can send my way that include any information about gossip conflicts, that would be super helpful as that's where I've been testing and looking thus far.

nvivo commented 7 years ago

@Aaronontheweb I'm not sure what to provide, I'm not seeing any logs related to Akka problems. Is there a way to enable debug logs for this without enabling heartbeat logs all the time?

Aaronontheweb commented 7 years ago

@nvivo one bit of data that might be helpful; if you have Petabridge.Cmd or Akka.Cluster.Monitor, can you send me a readout of what the cluster looks like according to that when you're connected to another node that's currently in the cluster when the issue happens?

What I see happening is that the newer gossip produced by the leader marking the node from Joining --> Up is overwritten by gossip on a non-leader node (indicates a consistency problem with how Gossip versions are checked,) but I still haven't been able to reproduce that behavior yet. I've written a couple of heavy-duty model based tests for trying to crack that; verified that the VectorClock and the MemberOrdering stuff is fine.... Have one I'm working on for the Gossip class itself, because I think the issue there is that the leader itself may not be incrementing the version of its gossip correctly under all conditions.

The reason why I think having a new node attempt to join resolves the issue for the previously stuck node is that it forces a gossip update that correctly bumps the version of the Gossip indicating that the previously stuck node is up, but not the next one.

nvivo commented 7 years ago

Here is what I get with Petabridge.Cmd:

[Unspecified/localhost:10150] pbm> cluster status
akka.tcp://SICluster@10.0.0.50:10000 | [seed] | up |
[Unspecified/localhost:10150] pbm> cluster show
akka.tcp://SICluster@10.0.0.15:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.16:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.20:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.23:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.25:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.27:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.30:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.39:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.40:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.45:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.47:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.50:10000 | [seed] | up |
akka.tcp://SICluster@10.0.0.50:10006 | [mgmt-api] | joining |
akka.tcp://SICluster@10.0.0.50:10007 | [website-api] | up |
akka.tcp://SICluster@10.0.0.50:10050 | [cache] | up |
akka.tcp://SICluster@10.0.0.50:10051 | [cache-updater] | up |
akka.tcp://SICluster@10.0.0.50:10052 | [backend] | up |
akka.tcp://SICluster@10.0.0.51:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.59:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.62:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.7:10050 | [accountservices] | up |

[Unspecified/localhost:10155] pbm> cluster status
akka.tcp://SICluster@10.0.0.50:10006 | [mgmt-api] | joining |
[Unspecified/localhost:10155] pbm> cluster show
akka.tcp://SICluster@10.0.0.15:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.16:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.20:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.23:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.25:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.27:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.30:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.39:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.40:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.45:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.47:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.50:10000 | [seed] | up |
akka.tcp://SICluster@10.0.0.50:10006 | [mgmt-api] | joining |
akka.tcp://SICluster@10.0.0.50:10007 | [website-api] | up |
akka.tcp://SICluster@10.0.0.50:10050 | [cache] | up |
akka.tcp://SICluster@10.0.0.50:10051 | [cache-updater] | up |
akka.tcp://SICluster@10.0.0.50:10052 | [backend] | up |
akka.tcp://SICluster@10.0.0.51:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.59:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.62:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.7:10050 | [accountservices] | up |

At this time, I have this mgmt-api node not connecting, but sometimes I have all accountservices nodes not being able to connect.

What I have been doing to solve it is to bring the entire cluster down and up again, including the seed and starting fresh. Then all are brought up correctly. This may explain why this may difficult to reproduce: it seems to work fine if the cluster is has been started recently.

From my experience, I have these problems if I restart a node after updating the app (only app code, not akka versions), or if AWS network does it's thing and gets unavailable for moments at random. If the cluster has been started recently (like, in the last minutes) I can get all nodes to be brought up correctly. If the cluster has been running for a day or more, this issue is more likely to happen.

Now, it's not always. Sometimes it just works even if I'm running the cluster for a few days. I'm inclined to believe this is somehow related to networking issues, as connections getting dropped unexpectedly, (AWS has lots of those for me), and # of cluster convergences, as it would explain why it doesn't happen when the cluster has recently started.

This may be completely unrelated, but I get lots of these messages during AWS outages:

System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
   at DotNetty.Transport.Channels.Sockets.SocketChannelAsyncOperation.Validate()
   at DotNetty.Transport.Channels.Sockets.AbstractSocketByteChannel.SocketByteChannelUnsafe.FinishRead(SocketChannelAsyncOperation operation)

And this happens multiple times a day, every day for all the nodes.

Aaronontheweb commented 7 years ago

I get lots of these messages during AWS outages

What do you mean by "AWS outage" exactly?

nvivo commented 7 years ago

Sorry, poor choice of words. I'm getting lots of connections closing unexpectedly during long periods (from friday night to monday mornings), and it suddenly stops and it gets fine for the entire week. This is affecting all my vms, not only for akka but for databases, http requests, etc. I'm talking to aws support.

But I believe my problems with akka are getting amplified by these issues.

oeaoaueaa commented 7 years ago

Not sure how useful this would be:

[127.0.0.1:9110] pbm> cluster show
akka.tcp://pricecustomization@cuiapi201.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapi202.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp201.cuidomain.com:33301 | [lighthouse] | up |
akka.tcp://pricecustomization@cuiapp201.cuidomain.com:33302 | [price-customizer] | up |
akka.tcp://pricecustomization@cuiapp201.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp202.cuidomain.com:33301 | [lighthouse] | up |
akka.tcp://pricecustomization@cuiapp202.cuidomain.com:33302 | [price-customizer] | up |
akka.tcp://pricecustomization@cuiapp202.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp203.cuidomain.com:33301 | [lighthouse] | up |
akka.tcp://pricecustomization@cuiapp203.cuidomain.com:33302 | [price-customizer] | up |
akka.tcp://pricecustomization@cuiapp203.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp204.cuidomain.com:33302 | [price-customizer] | joining |
akka.tcp://pricecustomization@cuiapp204.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp205.cuidomain.com:33302 | [price-customizer] | up |
akka.tcp://pricecustomization@cuiapp205.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp206.cuidomain.com:33302 | [price-customizer] | up |
akka.tcp://pricecustomization@cuiapp206.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp207.cuidomain.com:33302 | [price-customizer] | up |
akka.tcp://pricecustomization@cuiapp207.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@cuiapp208.cuidomain.com:33302 | [price-customizer] | up |
akka.tcp://pricecustomization@cuiapp208.cuidomain.com:33303 | [] | up |
akka.tcp://pricecustomization@localhost:14321 | [petabridge.cmd] | up |

Sometimes one of the nodes of the cluster get stuck joining, the only way of recovering from that is to stop the node that is the Leader and then starting it after a while to let the leave event propagate. This works in our CUI environment with <25 nodes but production is close to 50 nodes and in there we cannot stop a node for that long. Let me know if you need more information.

Aaronontheweb commented 7 years ago

alright, I have a reproduction scenario working on my local computer https://github.com/akkadotnet/akka.net/issues/2015 - the anecdotal data @nvivo gave me clued me in big time; it has to do more with when a network issue rears its head and affects the leader more than anything else. I'm going to be out of office for the next few days, but once I'm settled I expect to get a fix in for this next week in a release separate from 1.3... 1.2.1

crucifieddreams commented 7 years ago

Just to add to this discussion based on our experience with this. I have never seen this issue running a cluster on a single machine. To reproduce it I have always had to run the cluster across two or more machines and the lowest number of nodes in the cluster i've reproduced it with is 6 running over 2 machines (which is interesting that Aaron has suggested network instability). One we have the cluster over 3 machines we can reliably reproduce the issue.

In our production environment where we have many servers the issue is always present but manageable for now.

In our experience the nodes stuck at JOINING are still able to send and receive messages on the cluster most of the time (although where the cluster node that is causing the gossip to go wrong and preventing the leader bringing the stuck node up depends on the stuck node [i.e communicates with it for some business process] then communication fails obviously)...and when they can't we take time to attempt to get them to join. Identifying this is a very human task at the moment.

Restarting the leader works sometimes (generally if we see multiple nodes stuck at joining for a single server a leader restart is the solution, the leader is generally on a different server when this happens). If we have a single node stuck we can sometimes force the gossip to update by joining a dummy node to the cluster and the stuck node gets brought up allowing us to kill off the dummy cluster node. In other scenarios we change the IP/Port config in hocon to make the stuck node appear to be a different node to the cluster.

Aaronontheweb commented 7 years ago

See #2773 for fix in progress

alexvaluyskiy commented 7 years ago

Done https://github.com/akkadotnet/akka.net/pull/2773

nvivo commented 7 years ago

@Aaronontheweb sorry to report, but I updated the entire cluster with the new version, has been 15 minutes alterady and I'm still getting a node stuck joining:

akka.tcp://SICluster@10.0.0.10:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.14:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.15:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.16:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.20:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.23:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.25:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.27:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.30:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.39:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.40:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.45:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.47:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.50:10000 | [seed] | up |
akka.tcp://SICluster@10.0.0.50:10006 | [mgmt-api] | up |
akka.tcp://SICluster@10.0.0.50:10007 | [website-api] | up |
akka.tcp://SICluster@10.0.0.50:10050 | [backend] | up |
akka.tcp://SICluster@10.0.0.50:10051 | [cache] | up |
akka.tcp://SICluster@10.0.0.50:10052 | [cache-updater] | up |
akka.tcp://SICluster@10.0.0.51:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.54:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.55:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.59:10050 | [accountservices] | joining |
akka.tcp://SICluster@10.0.0.60:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.62:10050 | [accountservices] | up |
akka.tcp://SICluster@10.0.0.7:10050 | [accountservices] | up |

Tried to restart, and I get:

image

Aaronontheweb commented 7 years ago

@nvivo when you rebooted the node, what happened afterwards? was the previous incarnation downed and removed correctly? or was it still stuck at joining?

nvivo commented 7 years ago

I restarted the entire cluster from scratch, and let the nodes join. This node specifically got stuck at joining. I restarted the node process, and first, it got stuck leaving, and became "leaving | unreachable" in pbm.

Then, from the logs in the image, the leader removed the node successfully when the new one tried to join. I saw it removed in pbm as well.

I'm adding the debug flags and I'll post the logs here.

nvivo commented 7 years ago

Turned out one of the 20 nodes (the one stuck joining) had the wrong image for some reason and wasn't running 1.2.1. After upgrading manually it joined the nodes.

I'll keep monitoring and let you know if something else comes up.

Thanks!

Aaronontheweb commented 7 years ago

Looks like this is not a done deal yet; received new updates from @nvivo that indicate there's still a problem.

Aaronontheweb commented 7 years ago

@nvivo @oeaoaueaa @crucifieddreams ok.... this was a pain in the ass to find but the logs you sent in helped me.

I'm going to walk you through my solution and my reproduction, which helps demonstrate why this was so nasty to isolate.

Here's the offending line of code that creates the problem: https://github.com/akkadotnet/akka.net/blob/v1.3/src/core/Akka.Cluster/ClusterDaemon.cs#L2244

Here's the problem: when we log this message

"Leader is moving node [some node] to [Up]"

We're evaluating only the contents of the changedMembers collection, an ImmutableSortedSet<Member>. This will be important in a moment.

https://github.com/akkadotnet/akka.net/blob/v1.3/src/core/Akka.Cluster/ClusterDaemon.cs#L2285

In the previous line I linked, we call changedMembers.Union(localMembers) - when a node joins the cluster and gets marked as MemberStatus.Up it has one incarnation of its Member object contained inside the localMembers collection with status Joining and another in the changedMembers collection with status Up.

All Members do equality by value based solely on their UniqueAddress, so these two members will evaluate as Equal to each other by design:

var address1 = new Address("akka.tcp", "sys1", "host1", 9001);
var m1 =  return Member.Create(new UniqueAddress(address1, 1),0, MemberStatus.Joining, ImmutableHashSet<string>.Empty);
var m2 =  return Member.Create(new UniqueAddress(address1, 1),0, MemberStatus.Up, ImmutableHashSet<string>.Empty);

m1.Equals(m2); // will evaluate to true

So the implications of this are that if I stick m1 and m2 into the same ISet<Member> collection... Only one of them will be included in the set, because they evaluate to being equal with each other by design.

Well, bearing that in mind, let's look at the offending line of code:

https://github.com/akkadotnet/akka.net/blob/v1.3/src/core/Akka.Cluster/ClusterDaemon.cs#L2244

// replace changed members
                var newMembers = changedMembers
                    .Union(localMembers)
                    .Except(removedUnreachable)
                    .Where(x => !removedExitingConfirmed.Contains(x.UniqueAddress))
                    .ToImmutableSortedSet();

We only get the m2 member in the final newMembers object if the ImmutableSortedSet<Member>.Union method filters out the items from the localMembers collection. In other words, it has to keep everything in the changedMembers set and overwrites the older versions of those Member objects in the localMembers collection.

Based on the data I've collected, this usually happens but not always. The node stuck joining / leaving bug is caused by the edge cases where the nodes from changedMembers get discarded instead of nodes from localMembers. Those edge cases almost always occur when the localMembers collection is much larger than the newMembers collection, and here's why:

https://github.com/dotnet/corefx/blob/master/src/System.Collections.Immutable/src/System/Collections/Immutable/ImmutableSortedSet_1.cs#L324 - if the collection being passed into the union method (right-hand) is larger than the collection on the left, the operation is flipped around and the smaller collection gets processed last instead of first.

The placement algorithm used by ImmutableSortedSet also changes depending on the number of items added above a certain threshold: https://github.com/dotnet/corefx/blob/master/src/System.Collections.Immutable/src/System/Collections/Immutable/ImmutableSortedSet_1.cs#L329 - one of these two algorithms might favor the left over the right in some conditions.

The reason why this never showed up in any of our tests was that our test clusters weren't large enough to reproduce the error, plus one other thing: the interval at which nodes are added to the cluster needs to have some variation.

If I launched a 20 node cluster all at once, it'd work fine - because the collection on the left-hand side of the operation containing all of the modified nodes was always larger. In order for this problem to occur, one of the joining nodes would have to be 5-10 seconds behind the other ~19 or so nodes - at least 1 or 2 LeaderActionsTick need to fire in order for the other nodes to make it into the localMembers collection on the next pass. I was able to verify this by throwing a breakpoint into the middle of a 22 node cluster, forcing the leader to wait a couple of ticks, and then releasing it. And the issue was that the MemberStatus.Joining version of the Member was what was being passed into the gossip as part of the newMembers collection.

The fix should is as simple as cherry-picking the items from the left-hand side of the operation and then appending the missing ones from the right. That should do the trick.

I was also able to verify the instability of the collection itself with a spec earlier:

[Fact]
        public void MemberOrdering_must_work_with_set_union()
        {
            var address1 = new Address("akka.tcp", "sys1", "host1", 9001);
            var address2 = address1.WithPort(9002);
            var address3 = address1.WithPort(9003);

            var s1 = ImmutableSortedSet
                .Create(TestMember.Create(address1, MemberStatus.Joining));
                //.Add(TestMember.Create(address2, MemberStatus.Up, ImmutableHashSet<string>.Empty, upNumber: 1));

            var s2 = ImmutableSortedSet.Create(TestMember.Create(address1, MemberStatus.Up, ImmutableHashSet<string>.Empty, upNumber:2));

            var s3 = ImmutableSortedSet
                .Create(TestMember.Create(address1, MemberStatus.Up));
                //.Add(TestMember.Create(address2, MemberStatus.Up));

            var u1 = s2.Union(s1);
            u1.Should().BeEquivalentTo(s3);
            u1.Single(x => x.Address.Equals(address1)).Status.Should().Be(MemberStatus.Up);

            var s4 = ImmutableSortedSet
                .Create(TestMember.Create(address1, MemberStatus.Up))
                .Add(TestMember.Create(address2, MemberStatus.Up))
                .Add(TestMember.Create(address3, MemberStatus.Joining));

            var s5 = ImmutableSortedSet
                .Create(TestMember.Create(address1, MemberStatus.Up))
                .Add(TestMember.Create(address2, MemberStatus.Up))
                .Add(TestMember.Create(address3, MemberStatus.Up));

            var u2 = s4.Union(s1);
            u2.Should().BeEquivalentTo(s5);
            u2.Single(x => x.Address.Equals(address1)).Status.Should().Be(MemberStatus.Up);
        }

If I uncomment out those earlier lines, the spec will fail reliably. If I add the lines back, the spec will pass. This is also why you may have noticed that the stuck node would get marked as Up if a second node attempted to join - it may have tipped the Union algorithm to pick the left-hand copy of the member instead of the right.

Anyway, I'll leave this explanation up over night for feedback from all of you before I push a fix and some tests. Would love to hear your thoughts.

crucifieddreams commented 7 years ago

Hey Aaron,

That does sound like a major pain to track down :)

This statement

"If I launched a 20 node cluster all at once, it'd work fine - because the collection on the left-hand side of the operation containing all of the modified nodes was always larger. In order for this problem to occur, one of the joining nodes would have to be 5-10 seconds behind the other ~19 or so nodes - at least 1 or 2 LeaderActionsTick need to fire in order for the other nodes to make it into the localMembers collection on the next pass. "

Is exactly the behaviour we see. One of our developers wrote a batch script to launch the cluster very quickly (when all the nodes have been installed) and this proved much more reliable (not 100%) in getting all the nodes up and stable. With our normal deployment to an environment each node takes a while to come up as they get uninstalled then reinstalled and we always get one of them stuck JOINING.

Another point yesterday evening around 20:30 someone must have hit one of our web apps in the test environment which caused it to join the test cluster I sent you logs from. As this node came UP it caused the one stuck at JOINING to also be moved UP.

25 Jun 2017 20:28:33.667Leader is moving node [akka.tcp://7im-SI@xxxx:53506] to [Up] (Previously stuck node) 25 Jun 2017 20:28:28.589 Leader is moving node [akka.tcp://7im-SI@xxxx:53513] to [Up] (WebApp)

So when I checked this morning everything looked fine.

oeaoaueaa commented 7 years ago

Hi Aaron, This explanation fits perfectly the issue we are experiencing, in our case we deploy our services in batches of four instead of all at once. Thank you for investigating this very tricky issue, ImmutableSortedSet optimized/dual behavior made this bug even more difficult to track/solve.

nvivo commented 7 years ago

Makes sense for me as well. I recently changed the way I start the nodes from all at once to 2 at a time, and I have been able to reproduce this error much more frequently. Let's hope this is the last one!

Aaronontheweb commented 7 years ago

BTW, what gave this away as the issue was the enhanced logs you all sent me - they all showed the leader repeatedly incrementing its vector clock (good,) but the node still being marked as joining instead of up (bad.) That helped me isolate it to just the leader itself rather than any of the other receiving nodes doing something weird on merge. I also fixed one other cluster bug in the process of looking for this where we weren't incrementing the upNumber correctly.

So thanks for being great users and giving us really detailed information - it made the difference.

Aaronontheweb commented 7 years ago

PR is in along with detailed comments - https://github.com/akkadotnet/akka.net/pull/2794

Please weigh in there on the changes; test suite will take a bit to run in the meantime.

crucifieddreams commented 7 years ago

After running a few tests of the nightly build (1.2.1.402-beta) in one of our test environments I have been unable to reproduce this issue.

Looks like the PR Aaron has submitted has fixed it. Cracking job Aaron! I'd recommend anyone seeing this issue pick up the nightly build.

Thanks again for the effort involved in getting to the bottom of this.

Aaronontheweb commented 7 years ago

Nightlies can be found here: http://getakka.net/docs/akka-developers/nightly-builds

We'll be shipping 1.2.2 official ASAP which fixes this issue and some other bugs we found in the process. Thanks for all of your patience in bearing with us. This one was nasty.

nvivo commented 7 years ago

Just for the record, since yesterday using 1.2.2 and no more issues. Thanks!

oeaoaueaa commented 7 years ago

Same here, we are running with 1.2.3 having 22 nodes in our pre-prod environment and 53 in production today without any issue so far.

Thank you!