carl-camilleri-uom commented 3 years ago

Version Information Version of Akka.NET? 1.4.23 Which Akka.NET Modules? Akka.Cluster.Sharding

Describe the performance issue

A minimum viable repo that reproduces the issue is at https://github.com/carlcamilleri/benchmark-akka-cluster

Running two n2-standard-8 nodes in GCP (8 CPU @ 2.80 GHz and 32GB RAM) with Windows Server 2019 in GCP ("instance-1" and "instance-2"), and a third machine to run the benchmarks from

First check: curl http://instance-1:5000/5 Response: akka://ping-pong-cluster-system/system/sharding/PingPongActor/13/5(pid=2372,hostname=instance-2)

Therefore entity id 5 actor is hosted on instance-2 server

wrk -t48 -c400 -d30s http://instance-2:5000/5

Running 30s test @ http://instance-2:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 19.54ms 70.10ms 1.09s 94.22%
Req/Sec 2.41k 429.13 12.12k 86.40%
3360471 requests in 30.10s, 666.60MB read
Socket errors: connect 0, read 0, write 202, timeout 0
Requests/sec: 111645.29
Transfer/sec: 22.15MB

wrk -t48 -c400 -d30s http://instance-1:5000/5


Running 30s test @ http://instance-1:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 254.21ms 286.27ms 1.02s 78.54%
Req/Sec 119.09 127.71 0.94k 85.22%
73325 requests in 30.03s, 14.55MB read
Socket errors: connect 0, read 0, write 216, timeout 0
Requests/sec: 2441.41
Transfer/sec: 495.91KB

For interest I've also repeated test (1) (i.e. workload on the endpoint which requests the local actor) but with serialize-messages = on, and the result is:

Running 30s test @ http://instance-2:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 24.77ms 90.16ms 1.15s 94.33%
Req/Sec 1.89k 339.47 4.94k 90.42%
2579946 requests in 30.08s, 511.77MB read
Socket errors: connect 0, read 0, write 246, timeout 0
Requests/sec: 85760.95
Transfer/sec: 17.01MB

So Hyperion serialisation drops throughput from >111k to >85k, which is probably expected

Data and Specs

ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.

Expected behavior Cross-Machine communication in Cluster Sharding expected to be faster.

Actual behavior Cross-Machine communication in Cluster Sharding seems to be extremely slow and unusable for my use case (an OLTP workload)

Environment .NET 5.0 Windows Server 2019 n2-standard-8 machine in GCP (8 CPU @ 2.80 GHz and 32GB RAM)

Aaronontheweb commented 3 years ago

ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.

The local actor - is this actor hosted by the sharding system or is just a local actor floating in-memory?

Aaronontheweb commented 3 years ago

Need to differentiate where the bottleneck is:

Sharding system
Remoting system

If it's the remoting system, there's things you can do to adjust but ultimately we have to replace our transport system (it's the lion's share of the v1.5 roadmap)

If it's the sharding system we can patch that too - opened a similar issue on the entity spawning side last week: https://github.com/akkadotnet/akka.net/issues/5190

Can you give us some color here to inform these numbers? We'll take a look at the source too.

carl-camilleri-uom commented 3 years ago

Thanks, in reply to:

ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.

The local actor - is this actor hosted by the sharding system or is just a local actor floating in-memory?

Correct it's hosted by the sharding system, in both cases the REST endpoint performs this call:

var resPong = await _shardRegion.Ask<MsgPong>(new ShardEnvelope<MsgPing>(entityId, new MsgPing()));

So the "local" test is actually a call to the REST endpoint hosted on the machine where the sharding system has spawned the actor.

Happy to provide further details or tests that can help.

Thanks

Aaronontheweb commented 3 years ago

My working theory on this is, assuming you're getting both numbers from the same sample:

The hops used to route messages to remote Shards passing through the ShardRegion introduces additional routing latency - especially since you're not pipelining in that code sample;
The actual ShardRegion actor and Shard actors themselves, functioning as routers, are doing so in an inefficient manner and introduces latency in both local + remote scenarios, but it's more visible in the latter for reasons related to the additional routing that has to occur;
There is going to be inherent latency in the Akka.Remote I/O pipeline, which is unavoidable, but shouldn't be anywhere close to this severe at these levels.

Aaronontheweb commented 3 years ago

For running this test against a remote sharded actor in your scenario, do you just run two processes and target a entity id not hosted on the HTTP target node?

carl-camilleri-uom commented 3 years ago

Correct, basically the setup is as follows: instance-1 : Windows VM running an instance of the code in the repo instance-2 : Windows VM running an instance of the code in the repo benchmark-machine : Makes calls to either instance-1 or instance-2

From benchmark-machine, curl http://instance-1:5000/5 returns akka://ping-pong-cluster-system/system/sharding/PingPongActor/13/5(pid=2372,hostname=instance-2, so we know that in this case entity id = 5 is hosted by the sharding system on instance-2 machine.

So hitting the endpoint http://instance-2:5000/5 constitutes a call to the local actor, whereas http://instance-1:5000/5 constitutes a call to the remote actor

Aaronontheweb commented 3 years ago

Thanks! I'll take a look at this and at the very least, resolve https://github.com/akkadotnet/akka.net/issues/3083

Aaronontheweb commented 3 years ago

We've been able to reproduce the 2.4k msg/s figure exactly in the benchmarks we created on #5209

I'm working on using Phobos to do some end-to-end tracing on the shard routing system to see where the most time is piling up. Did a read of the code last night and nothing obvious jumps out at me, but there is a drastic difference in remote vs. local shard performance that is clearly visible and consistent across different hardware profiles.

Aaronontheweb commented 3 years ago

First time a shard + entity is allocated, Phobos graph:

Aaronontheweb commented 3 years ago

End to end trace of an already-created entity actor from a remote host:

Aaronontheweb commented 3 years ago

A local shard routed version of that chart, for comparison:

Aaronontheweb commented 3 years ago

Another example of slow message handling inside the shard / shard region actors going across node

Aaronontheweb commented 3 years ago

I think I know where to look now - it looks like there's a combination of:

Dispatch / context switching overhead - the gaps between spans on these charts, some of which will be Akka.Remote invocations. This could also be caused by buffering inside the ShardRegion but that won't show up explicitly in the traces here.
Long-running message processing operations - some of those runtimes are going to be skewed by Phobos itself since there's some overhead during tracing but it shouldn't be much.

I think I have enough data to go off of - I'll get started on improving the figures in our benchmark.

carl-camilleri-uom commented 3 years ago

Having done some checks I believe the perf. bottleneck is coming from the amount of boxing/unboxing that is performed by the ShardMessageExtractor

The attached ZIP contains: patch_without-unboxing.diff - a patch that avoids unboxing in an extremely hacky way patch_with-unboxing.diff - a patch that is the same as patch_without-unboxing.diff but simply enables boxing/unboxing patches.zip

Without unboxing, the benchmark SingleRequestResponseToRemoteEntity results in ~350ms/op:

Simply adding boxing/unboxing drops perf to ~4s/op:

In comparison, on the same machine, the benchmark SingleRequestResponseToLocalEntity is also at ~350ms/op:

Possibly an approach could be a generic version of IMessageExtractor that then avoids the boxing/unboxing overhead in the custom ShardMessageExtractor implementation ? I believe this is the way it is being done in the JVM

Thanks

Aaronontheweb commented 3 years ago

@carl-camilleri-uom that's great work. Explains why this issue is unique to sharding. I also noticed we're doing some things like converting the IMessageExtractor into a delegate at startup inside the ShardRegion, which likely does not help either.

Adding a generic IMessageExtractor should be doable, but we'll still need to clean up the delegate mess inside ShardRegion. I'll see if I can deal with the former first.

We can also add a benchmark for the HashCodeMessageExtractor if there isn't one already.

Aaronontheweb commented 3 years ago

The changes you included in your patch file - I can't get them to run.

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.                  
 ---> System.ArgumentNullException: The message cannot be null. (Parameter 'message')                                   
   at Akka.Actor.Envelope..ctor(Object message, IActorRef sender, ActorSystem system) in D:\Repositories\olympus\akka.ne
t\src\core\Akka\Actor\Message.cs:line 28                                                                                
   at Akka.Actor.ActorCell.SendMessage(IActorRef sender, Object message) in D:\Repositories\olympus\akka.net\src\core\Ak
ka\Actor\ActorCell.cs:line 418                                                                                          
   at Akka.Actor.Futures.Ask[T](ICanTell self, Func`2 messageFactory, Nullable`1 timeout, CancellationToken cancellation
Token) in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 143                                      
   at Akka.Actor.Futures.Ask[T](ICanTell self, Object message, Nullable`1 timeout, CancellationToken cancellationToken) 
in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 105                                             
   at Akka.Cluster.Benchmarks.Sharding.ShardMessageRoutingBenchmarks.SingleRequestResponseToRemoteEntity() in D:\Reposit
ories\olympus\akka.net\src\benchmark\Akka.Cluster.Benchmarks\Sharding\ShardMessageRoutingBenchmarks.cs:line 112         
   at BenchmarkDotNet.Autogenerated.Runnable_1.__Workload()                                                             
   at BenchmarkDotNet.Autogenerated.Runnable_1.WorkloadActionUnroll(Int64 invokeCount)                                  
   at BenchmarkDotNet.Engines.Engine.RunIteration(IterationData data)                                                   
   at BenchmarkDotNet.Engines.EngineFactory.Jit(Engine engine, Int32 jitIndex, Int32 invokeCount, Int32 unrollFactor)   
   at BenchmarkDotNet.Engines.EngineFactory.CreateReadyToRun(EngineParameters engineParameters)                         
   at BenchmarkDotNet.Autogenerated.Runnable_1.Run(BenchmarkCase benchmarkCase, IHost host)                             
   --- End of inner exception stack trace ---                                                                           
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor, Boo
lean wrapExceptions)                                                                                                    
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters
, CultureInfo culture)                                                                                                  
   at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)                                              
   at BenchmarkDotNet.Toolchains.InProcess.Emit.Implementation.RunnableProgram.Run(BenchmarkId benchmarkId, Assembly par
titionAssembly, BenchmarkCase benchmarkCase, IHost host)                                                                
ExitCode != 0 and no results reported                                                                                   
No more Benchmark runs will be launched as NO measurements were obtained from the previous run!

Could you send them as a PR instead?

Aaronontheweb commented 3 years ago

So this is really an Akka.Remote performance issue. I replicated the actor hierarchy in its essence here: https://github.com/Aaronontheweb/RemotingBenchmark


BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
  [Host]     : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
  DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT

Method	MsgCount	Mean	Error	StdDev
SingleRequestResponseToLocalEntity	10000	3.575 s	0.0681 s	0.0569 s

We can shave off maybe 25-30% of the performance overhead by improving the sharding system's efficiency of message handling, but it still comes down to Akka.Remote.

What's interesting here is that the benchmark you designed is an absolute worst case performance scenario for Akka.Remote:

this use case is pretty interesting too because:

Using temporary actors on all of the Ask<T> operations eliminates our ability to cache deserialized IActorRefs on both sides of the connection;
await-ing each call eliminates our ability to batch in Akka.Remote and in all of the actors' mailboxes; and
await-ing each request / response pair blocks flow for the entire stream, meaning one request can't start before the previous one completes.

Factor 3 is the most expensive to overcome - for the sake of comparison, if I change

[Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
    for (var i = 0; i < MsgCount; i++)
        await _sys2Remote.Ask<ShardedMessage>(_messageToSys2);
}

To

[Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
    var tasks = new List<Task>();
    for (var i = 0; i < MsgCount; i++)
        tasks.Add(_sys2Remote.Ask<ShardedMessage>(_messageToSys2));
    await Task.WhenAll(tasks);
}

The performance profile changes to:


BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
  [Host]     : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
  DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT

Method	MsgCount	Mean	Error	StdDev
SingleRequestResponseToLocalEntity	10000	637.2 ms	8.48 ms	7.51 ms

That looks more like it to me - all of the tasks were completed in the same order but they were just all allowed to start at the same time. If I change this to use Tell we get this entire workload done in about 110ms using Sharding, which is about ~90k msg/s (sounds about right.)

Here's what I don't understand about your benchmark @carl-camilleri-uom - you ran this:

wrk -t48 -c400 -d30s http://instance-1:5000/5

And got

Running 30s test @ http://instance-1:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 254.21ms 286.27ms 1.02s 78.54%
Req/Sec 119.09 127.71 0.94k 85.22%
73325 requests in 30.03s, 14.55MB read
Socket errors: connect 0, read 0, write 216, timeout 0
Requests/sec: 2441.41
Transfer/sec: 495.91KB

Is wrk forcing all of its HTTP requests to that endpoint to run in sequence or in parallel? Because by the looks of your numbers they line up exactly with what would happen if the requests were run only in sequence. In a real world scenario with multiple concurrent requests for the same entity arriving all at once you'd expect results similar to my second set of benchmark numbers. What can account for the difference there?

carl-camilleri-uom commented 3 years ago

@Aaronontheweb thanks for this information.

First of all, apologies for the initial indication that boxing/unboxing was causing the issue - it was a red herring as indeed with your benchmark on HashCodeMessageExtractor I'm not able to replicate any latency.

Secondly, thanks also for the details. provided.

With regards to the workload submitted by wrk, I believe this is parallelised by the number of connections according to the specs at https://github.com/wg/wrk.

Thus wrk -t48 -c400 -d30s http://instance-2:5000/5 should be executing 400 concurrent connections allocated over 48 threads on the benchmark machine.

I have also tried tests using wrk2 (https://github.com/giltene/wrk2) with the same results.

I have been testing some further approaches as follows and the repo at https://github.com/carlcamilleri/benchmark-akka-cluster has been updated as follows:

The same application exposes a new endpoint /batch/{entityId} 100k messages to the entity actor and awaits them all. A single request to this endpoint and requesting a remote entity completes within an average of 2.56s which would work out at ~40k req/s and is even in-line with your benchmarks when awaiting a set of Tasks:
The endpoint /{entityId} is now taking another approach whereby the request received on the API is sent to a local Actor Router, which then handles the request to the cluster without explicitly await-ing, although of course the web server would wait for the response from the actor to complete the HTTP request. This still re-produces the issue: I don't get more than 3k req/s

I guess therefore my question is whether there is perhaps a better way to approach this problem? Basically, the problem at hand could be described simply as the need to implement a REST API that gives back the details of a business entity that is cached within a sharded cluster.

The API consumers are of course independent and the API is synchronous i.e. the API consumer would need to wait for the response from the API. If I understand well, this is what is introducing the problem (at least with my approach) i.e. where the different (parallel, mutually-exclusive) requests on the API are being handled in a serial manner on the cluster.

Thank you

carl-camilleri-uom commented 3 years ago

@to11mtm @Aaronontheweb thanks for the analysis in #5230 . Just to confirm do we expect this to improve performance even in the case of the following approach? :

akka_behaviours-Page-2

For reference I have benchmarked this scenario (https://github.com/carlcamilleri/benchmark-akka-cluster/blob/master/Startup.cs) which is still performing poorly

Thanks

to11mtm commented 3 years ago

@to11mtm @Aaronontheweb thanks for the analysis in #5230 . Just to confirm do we expect this to improve performance even in the case of the following approach? :

For reference I have benchmarked this scenario (https://github.com/carlcamilleri/benchmark-akka-cluster/blob/master/Startup.cs) which is still performing poorly

Thanks

5320 would theoretically improve performance issues around remote asks. I don't think they would help in the case of the scenario in your benchmark.

Looking at said benchmark however, I would suggest:

Using a Round-Robin router instead of a Random Router
Using a lot more actors in the round-robin pool (I'd say probably 25-50).

carl-camilleri-uom commented 3 years ago

@to11mtm thanks, I have added a small benchmark as part of ShardMessageRoutingBenchmarks to test this approach with a RoundRobinPool of 50, and even increased to 1000.

The results are below: I still get >3s to complete 10000 messages, so it seems not using ASK on the remote path does not improve much on the results from SingleRequestResponseToRemoteEntity

Thanks


BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.206
  [Host]     : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT
  Job-IUOUID : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	StateMode	MsgCount	Mean	Error	StdDev
SingleRequestResponseToRemoteEntityWithLocalProxy	Persistence	10000	3.113 s	0.0616 s	0.0710 s
SingleRequestResponseToRemoteEntityWithLocalProxy	DData	10000	3.126 s	0.0417 s	0.0390 s

Aaronontheweb commented 3 years ago

The results are below: I still get >3s to complete 10000 messages, so it seems not using ASK on the remote path does not improve much on the results from SingleRequestResponseToRemoteEntity

Your performance issue here is flow control primarily:

[Benchmark]
        public async Task SingleRequestResponseToRemoteEntityWithLocalProxy()
        {
            for (var i = 0; i < MsgCount; i++)
                await _localRouter.Ask<ShardedMessage>(new SendShardedMessage(_messageToSys2.EntityId, _messageToSys2));
        }

It's the await that is killing you, as I pointed out here: https://github.com/akkadotnet/akka.net/issues/5203#issuecomment-902828065

I think there is something wrong with your HTTP benchmark as you should be seeing throughput numbers similar to where there are lots of concurrent Ask calls - performance more on the order of hundreds of ms, not thousands, for 10k messages. That what you're seeing lines up nearly identically with our serial Ask<T> numbers makes me suspicious. I wonder if it's a synchronization context issue?

carl-camilleri-uom commented 3 years ago

@Aaronontheweb I understand it's not ideal to await however I'm not sure I understand what should be the correct approach for the use case at hand (https://github.com/akkadotnet/akka.net/issues/5203#issuecomment-903241323) ?

In principle I am using the approach described at https://petabridge.com/blog/akkadotnet-aspnet/ for Request / Response . Perhaps there's a better approach?

Furthermore, using the same approach for a local entity gives acceptable throughput even with await :

    public async Task SingleRequestResponseToLocalEntity()
        {
            for (var i = 0; i < MsgCount; i++)
                await _shardRegion1.Ask<ShardedMessage>(_messageToSys1);
        }


BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.206
  [Host]     : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT
  Job-WQMFMH : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	StateMode	MsgCount	Mean	Error	StdDev
SingleRequestResponseToLocalEntity	Persistence	10000	63.94 ms	1.260 ms	2.139 ms
SingleRequestResponseToLocalEntity	DData	10000	65.08 ms	1.233 ms	2.254 ms

With the SingleRequestResponseToRemoteEntityWithLocalProxy I have tried to come up with an approach where it's not relying on an ASK for a remote operation, but it seems that it doesn't solve the problem?

Thank you

Aaronontheweb commented 3 years ago

I understand it's not ideal to await however I'm not sure I understand what should be the correct approach for the use case at hand

Your original benchmark code was just fine for implementing request-response - but when many concurrent HTTP clients are all hitting that endpoint, all of their Ask<T>s are running in parallel. That's not the case with our benchmarks where we're getting 3 seconds per 10,000 messages - all of those Ask<T>s wait in sequence. That's the problem: concurrent vs. sequential. Has nothing to do with Ask<T> - it's whether many requests can be in-flight at once or whether they get serialized into a 1 by 1 by 1 sequence. Does that make sense - that what you're seeing is the combination of real I/O (difference between 5ms and 500ms in our local vs. remote Tell benchmarks) and flow control (difference between 500ms and ~3s in Tell vs. Ask)? That the issue is that the way those Ask benchmarks are structured, one request can't begin until another completes?

What's weird is that your benchmark was designed to be concurrent but its performance numbers match the sequential access benchmarks - that's what I don't understand.

Furthermore, using the same approach for a local entity gives acceptable throughput even with await :

That's all happening in-memory and there's no real I/O involved, so the numbers will be good. We know that Akka.Remote definitely has some performance constraints since it's all running over a single TCP connection per node and we've plans to fix that, but it's a bigger project.

Aaronontheweb commented 3 years ago

Since writing my last comment we've improved the performance of Akka.Remote by ~40%: https://github.com/akkadotnet/akka.net/pull/5247

However, the improvement in the Akka.Cluster.Sharding benchmark is only about ~15% - the thing that really moved the needle was our improvements around IActorRef resolution in the deserialization pipeline, which has a high impact on the temporary actors that power Ask<T> and thus can't be cached on either read-side (receiver in, sender in). That also helped speed up child lookups inside the ShardRegion too but the effects of that are pretty small outside of a micro-benchmark.

As to why this benchmark isn't benefitting linearly from the Akka.Remote improvements, it likely has to do with not being able to take advantage of I/O batching or caching on either side due to how the temporary actors work and how each request --> response pair has to complete synchronously in this benchmark as it's written. I'd like to run @carl-camilleri-uom 's original HTTP benchmark first and get some more data from that, since that benchmark should benefit from I/O batching but that's not apparent from his numbers.

I think there's more that could be done to speed up some individual operations happening on the read-side in particular, since we're pretty certain that's where the bottleneck is in our system, but that's going to take some more trial-and-error to find.

I'm also going to look into speeding up the operations of the sharding actors themselves - we know they're wasteful: https://github.com/akkadotnet/akka.net/pull/5217

carl-camilleri-uom commented 3 years ago

Hi,

Just in case it is of interest, I have in the meantime performed similar benchmarks in the JVM world.

I have reported my findings at https://github.com/akka/akka/issues/30546.

Throughput is faster on Akka (JVM) but I still replicate a massive drop in throughput when comparing calls to local actors vs. calls to a remote actor in the cluster. In one particular test, the drop went from >120k req/s to <15k req/s.

One very empirical observation that I get from benchmarking both Akka.Net and Akka (JVM) is that I am never able to saturate the CPU of the cluster machines when calling a remote entity, whereas the CPU is easily saturated when calling a local entity in the cluster.

This behaviour does not seem to be related to network I/O - for interest I have tried setting up a GRPC endpoint and tested it under the same conditions i.e. same benchmark, but instead of making a call to the entity shard, making a call to the remote GRPC endpoint. In this case, CPU on both machines is saturated and throughput remains approximately the same as if a local actor is called.

At this stage I start wondering whether this is due to a single actor being responsible to marshal calls on a per-shard basis. And therefore any request to the same shard has to go through a single actor.

This may be a stupid idea as I do not have full knowledge of the internals of the Akka clustering working, so apologies in advance!

Maybe I'm also hitting a bit of an Achilles heel but I think my use case is pretty much a very common use case of having an Akka(.net) cluster serving as a distributed cache of objects, without uniform traffic i.e. some of those objects (or indeed, shards) would be responsible for the majority of the traffic.

Thanks

Aaronontheweb commented 3 years ago

At this stage I start wondering whether this is due to a single actor being responsible to marshal calls on a per-shard basis. And therefore any request to the same shard has to go through a single actor.

This may be a stupid idea as I do not have full knowledge of the internals of the Akka clustering working, so apologies in advance!

It's definitely not stupid! Since you reported your issue we've just about doubled the throughput of Akka.Remote but that hasn't translated into a linear correlation in shard performance, which indicates exactly what you suspect: a bottleneck somewhere inside the sharding system.

I think it's very likely the ShardRegion --> ShardRegion --> Shard messaging path that is acting like the bottleneck in this case. In the local scenario the messages only have to follow a ShardRegion --> Shard -- > Entity path, but that latency might well be paved over with the fact that there's no real I/O at play.

Ultimately there still needs to be a system for guaranteeing that there's exactly one instance of each entity actor, which is why this exists - but based on some work I and others have done, I think we can make the existing implementation more efficient. Redesigning sharding to remove the bottleneck is a tall order, but reducing the head-of-line blocking inside the Shard actors is achievable.

to11mtm commented 3 years ago

@Aaronontheweb some notes:

Regarding overhead of sharded actors/etc:

One benchmark worth considering would be a 'forwarded ask' benchmark in a local scenario. i.e.
- (Ask) to Actor (A) which forwards to (B), (B) responds to (Ask)
- (Ask) to Actor (A) which forwards to (B), (B) forwards to Actor (C), (C) responds to ask
- [One more iteration of this pattern]

By just forwarding, this will give us some insight as to the overhead of forwarding between actors as opposed to what we are doing in those actors themselves.

Regarding Remote Asks:

but that latency might well be paved over with the fact that there's no real I/O at play.

One big difference between local with serialize-messages and remote is that the protobuf serialization/deserialization and remote actorref resolve stages are elided.

Having stared at a lot more profiler trees than I'd like to admit, I can say the following:

Protobuf Deserialization (Of the remote envelope) is about as expensive as Resolving the path/Ref.
Hyperion Deserialization (Of the actual message)is still a little slower than either the Protobuf Deserialization or Actor Resolve stages. If one is using newtonsoft the impact is even greater.
Re-serializing a local actorpath repeatedly is something I wish we could elide, but it would be tricky to do.
The way we look up serializers by ID makes me wish we could abuse an array instead of a dictionary. If we were able to spec serializer IDs as having a max value, we probably could...
TouchLastMessageTimestamp is called in Shard Delivery and adds up; if passivate-idle is set to Off (And one managed manually with ReceiveTimeout) you might be able to eke out a little more performance. I'm making a very blind assumption that using ReceiveTimeout on an actor is going to be a tradeoff, since that is getting put on the main scheduler, so you'll pay the cost there instead of in the sharding pipeline.
Systems.Collections.Immutable.ImmutableDictionary is a (small) pain point. At some point it might be worth investigating if a CHAMP style Map would be better than SCI's AVL Trees. There's tradeoffs with each. language-ext provides one but it is too expensive (3MB!) to ask users to pull in.
Our HashSet checks on Address are still a little painful.
Getting rid of Delegate Allocations in ForkJoinExecutor is not hard to do and should have a small payoff.
We lose -some- perf in our use of UnfairSemaphore due to its having too many fields (The number of which cause a number of struct optimizations to not be performed, the fact they are shared via FieldOffset doesn't matter.) It is interesting to note that in NetCore they have gotten around this in LowLevelLifoSemaphore via bit shifting ops. This looks like a possibly minimally-painful way to eke a little bit of extra performance out of the Old threadpool (which is still preferable to those who want to be sure they are off the .NET Threadpool)
- In any case, because of how sharding works, the shards wind up on internal dispatcher, and we spend some time in UnfairSemaphore.Release.
Adding ReadOnlyMemory<byte> overloads for Deserialization in an overridable fashion (i.e. virtual that just calls byte[] version with .ToArray()) will let us build 'opt-in' deserialization where copies can be avoided by remoting as well as others.
Protobuf now has an UnsafeByteOperations.UnsafeWrap option that lets you use a byte array as a byte string without a copy. You are on your honor to not modify the byte array after doing so (we don't in remote, so it would be a safe op)
I've tried an Ask Cache. It -kinda- works. The challenge is that either we have to pay for Cycle on add, or manually Remove on the LRU somewhere. While offloading that work to another actor helps, the cost of putting a message onto that actor's mailbox isn't -that- much better, and then we need to give it a thread on the remote dispatcher to avoid getting in the way of remoting, which is another cost.
Throughput tweaking on internal-dispatcher may or may not be warranted.
I really, really, really wish we could get rid of DotNetty. Or at least update it.
I had this really really stupid idea but I don't know if it would work. Basically, when resolving ActorRefs, if we see that it is a '/temp/' actor we could push it to a special TempResolver Actor round-robin pool that finishes resolution and forwards to the actual waiting actor. This would still "maintain" ordering guarantees (I use quotes here because Asks destroy themselves after receiving the message, no real ordering to a single message right?) and lower blocking inside the receive pipeline. The only concern would be diminished returns from the cost of putting the message onto the mailbox (If we did this, definitely should do the ForkJoin change,) and the additional resource cost of another actor.
- This would not work as well as the Ask cache for a single ask, but I think it would scale better (i.e. getting rid of HOL blocking would help better in scenarios with multiple asks going on, as well as non-ask traffic going on in the system.)

Aaronontheweb commented 3 years ago

If we were able to spec serializer IDs as having a max value, we probably could...

The literature on those is a max value of 1024 with 0-99 reserved for internal Akka.NET use.

Aaronontheweb commented 3 years ago

I really, really, really wish we could get rid of DotNetty. Or at least update it.

That makes two of us. The new version of DotNetty does replace some of its threading internals but the results were gruesome. I want to push on the gas pedal for Akka.NET v1.5 and Artery but my docket is full of other, mostly Petabridge-related issues at the moment.

to11mtm commented 3 years ago

The literature on those is a max value of 1024 with 0-99 reserved for internal Akka.NET use.

So, what we could do here, is allocate an array of 1280 Serializers up front. We do this because I am 98% sure we have some internal serializers defined with negative IDs, so everything in the array is shifted +256. This makes Serializer lookups by ID a simple-ish array check instead of a dictionary lookup. This will increase memory usage by 5/10KB (32/64 bit) but it will be a fixed cost. Bonus, this will also help scenarios in Persistence and elsewhere that we look up the Serializer by ID.

I really, really, really wish we could get rid of DotNetty. Or at least update it.

That makes two of us. The new version of DotNetty does replace some of its threading internals but the results were gruesome. I want to push on the gas pedal for Akka.NET v1.5 and Artery but my docket is full of other, mostly Petabridge-related issues at the moment.

Didn't see this before, so I'll say it here; maybe SpanNetty would be worth a look from a low-effort gain standpoint. The other low-effort items would be ForkJoin delegate removal, UnsafeWrap in Remote, Serializer IDs, and doing some EqualityComparer defs on various collections.

Aaronontheweb commented 2 years ago

Measuring Sharding performance with some of the serialization changes that will be introduced in v1.4.30 - these are mostly memory / allocation optimizations so the impact on throughput should be minor.

1.4.28

// Summary

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1348 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.100
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
Job-TMZJWF : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

Method	StateMode	MsgCount	Mean	Error
-----------:
SingleRequestResponseToLocalEntity	Persistence	10000	112.852 ms	2.2494 ms
3.5020 ms
StreamingToLocalEntity	Persistence	10000	4.797 ms	0.2881 ms
0.8267 ms
SingleRequestResponseToRemoteEntity	Persistence	10000	3,532.257 ms	11.9092 ms
10.5572 ms
SingleRequestResponseToRemoteEntityWithLocalProxy	Persistence	10000	4,060.152 ms	7.0778 ms
5.9103 ms
StreamingToRemoteEntity	Persistence	10000	412.605 ms	5.3912 ms
5.0429 ms
SingleRequestResponseToLocalEntity	DData	10000	108.035 ms	2.1032 ms
2.5037 ms
StreamingToLocalEntity	DData	10000	4.682 ms	0.2615 ms
0.7420 ms
SingleRequestResponseToRemoteEntity	DData	10000	3,559.191 ms	11.0759 ms
10.3604 ms
SingleRequestResponseToRemoteEntityWithLocalProxy	DData	10000	4,057.514 ms	8.6033 ms
7.6266 ms
StreamingToRemoteEntity	DData	10000	407.138 ms	3.0236 ms
2.8283 ms

1.4.30

// Summary

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1348 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.100
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
Job-SWGOFT : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

Method	StateMode	MsgCount	Mean	Error
-----------:
SingleRequestResponseToLocalEntity	Persistence	10000	112.381 ms	2.2262 ms
2.8155 ms
StreamingToLocalEntity	Persistence	10000	5.058 ms	0.2931 ms
0.8504 ms
SingleRequestResponseToRemoteEntity	Persistence	10000	3,518.615 ms	10.0682 ms
8.9252 ms
SingleRequestResponseToRemoteEntityWithLocalProxy	Persistence	10000	4,024.819 ms	7.6202 ms
6.7551 ms
StreamingToRemoteEntity	Persistence	10000	405.815 ms	3.9392 ms
3.6847 ms
SingleRequestResponseToLocalEntity	DData	10000	109.762 ms	2.1667 ms
3.3088 ms
StreamingToLocalEntity	DData	10000	4.806 ms	0.2646 ms
0.7550 ms
SingleRequestResponseToRemoteEntity	DData	10000	3,587.875 ms	15.3351 ms
13.5941 ms
SingleRequestResponseToRemoteEntityWithLocalProxy	DData	10000	4,065.668 ms	8.6141 ms
7.6362 ms
StreamingToRemoteEntity	DData	10000	402.225 ms	4.5253 ms
4.2329 ms

akkadotnet / akka.net

[PERF] Cluster Sharding perf issue #5203

5320 would theoretically improve performance issues around remote asks. I don't think they would help in the case of the scenario in your benchmark.

Regarding overhead of sharded actors/etc:

Regarding Remote Asks:

1.4.28

1.4.30