akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.72k stars 1.04k forks source link

Akka.Cluster.Sharding `ShardRegion` - `DurableData` bottleneck #5190

Closed Aaronontheweb closed 1 year ago

Aaronontheweb commented 3 years ago

Version Information Version of Akka.NET? v1.4.23 (and also reproduced with v1.4.22) Which Akka.NET Modules? Akka.Cluster.Sharding + DData

Occurs on Linux and Windows

Describe the performance issue Using a reproduction sample I created while testing our solution for https://github.com/akkadotnet/akka.net/issues/5174, I used the following configuration + code:

akka {
  actor {
    provider = cluster
  }

  remote {
    dot-netty.tcp {
      public-hostname = "localhost"
      hostname = "0.0.0.0"
      port = 4051
    }
  }            

  cluster {
    downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
    split-brain-resolver {
      active-strategy = keep-majority
    }

    sharding{
        state-store-mode = ddata
        remember-entities = on
    }

    seed-nodes = [] 
    roles = []
  }
}
var sharding = ClusterSharding.Get(ClusterSystem);
var shardRegion = sharding.Start("entity", s => Props.Create<EntityActor>(s),
    ClusterShardingSettings.Create(ClusterSystem),
    new EntityRouter(100));

var cluster = Cluster.Get(ClusterSystem);
cluster.Join(cluster.SelfAddress);

Cluster.Get(ClusterSystem).RegisterOnMemberUp(() =>
{
    ClusterSystem.Scheduler.Advanced.ScheduleRepeatedly(TimeSpan.FromMilliseconds(100),
        TimeSpan.FromMilliseconds(100),
        () =>
        {
            for (var i = 0; i < 25; i++)
            {
                shardRegion.Tell(new EntityCmd(ThreadLocalRandom.Current.Next().ToString()));
            }
        });
});

The sharding configuration is very vanilla - a basic setup without many frills.

Data and Specs This configuration allowed for:

Within a few seconds of starting up the solution, I began receiving messages along the lines of:

[WARNING][8/10/2021 2:58:19 AM][Thread 0012][akka.tcp://ClusterSys@desktop-13cpqtr:4051/system/sharding/entity] entity: Requested shard homes [0, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 22, 23, 24, 25, 26, 27, 28, 29, 3, 31, 3
4, 37, 39, 4, 40, 41, 42, 43, 44, 45, 46, 47, 50, 56, 57, 59, 61, 62, 63, 64, 65, 68, 69, 7, 70, 71, 74, 75, 76, 77, 79, 8, 82, 84, 85, 86, 87, 88, 89, 9, 92, 93, 95, 96, 97, 98, 99] from coordinator at [[akka://ClusterSys/system
/sharding/entityCoordinator/singleton/coordinator#208095769]]. [2790] total buffered messages.

The issue here appears to be that the messages we're attempting to route to the entity actors starve out the messages needed to allocate the shards where those entity actors will live. That's a problem.

This issue does not occur when:

This leads me to believe that the performance issue here is likely caused by how we interact with the DurableStore when using DData mode.

You can see the full demo here: https://github.com/Aaronontheweb/Akka.Cluster.Sharding.DDataDemo

Expected behavior I'd expect the ShardRegion to be able to support the creation of thousands of entities per second, particularly at node startup when remember-entities = on and it should be able to still allocate shards while it's doing that AND processing new messages intended for those entities.

Actual behavior The system locked up and the buffer perpetually filled up.

Environment .NET Core 3.1, Windows 10 bare metal .NET Core 3.1, Ubuntu 20.04, WSL2

Aaronontheweb commented 1 year ago

Mostly resolved via the changes introduced in Akka.NET v1.5