akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.7k stars 1.04k forks source link

Akka.Cluster.Tools.Singleton: singleton moves earlier than expected - as soon as new node joins #7196

Closed Aaronontheweb closed 2 months ago

Aaronontheweb commented 4 months ago

Version Information Version of Akka.NET? v1.5.0 Which Akka.NET Modules? Akka.Cluster.Tools

Describe the bug

Chasing down and issue for a production support customer - they have a custom pbm command for being able to track the location of cluster singletons. They confirmed the singleton was on a specific node and decided to replace that one last during a version upgrade. What they observed was: the singleton moved onto the newest node with the highest AppVersion even before that oldest node was downed!

Expected behavior

As I wrote back to the customer originally, the singleton should only move onto a new node AFTER the node it's currently on begins to leave the cluster. This leads me to believe that the following code might have a bug in how we compute the sort order for who the most suitable location is for a singleton:

https://github.com/akkadotnet/akka.net/blob/3f0be58a661150c3d14572cd4615b526ba5e037a/src/contrib/cluster/Akka.Cluster.Tools/Singleton/OldestChangedBuffer.cs#L98-L112

In fact, I'm almost certain that this is the case.

Aaronontheweb commented 4 months ago

Marking this bug as critical - one of the major side effects from this issue is that we can create split brains with all cluster singletons during deployments when the AppVersion is getting bumped. That can result in problems such as #6973

Aaronontheweb commented 4 months ago

So this bug likely affected less people than I initially thought as

https://github.com/akkadotnet/akka.net/blob/d1ed226e8b140215427bbd8ffd58130662d7ff28/src/contrib/cluster/Akka.Cluster.Tools/Singleton/reference.conf#L49

Has been set to false this whole time and that's also the default value from the HOCON extractors when this configuration isn't available. That's good news, but it still needed to be fixed.

Aaronontheweb commented 4 months ago

Looks like the original issue reported by the end user wasn't even caused by the AppVersion, but this feature is definitely a footgun and probably needs to be removed.