Open carl-camilleri-uom opened 3 years ago
ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.
The local actor - is this actor hosted by the sharding system or is just a local actor floating in-memory?
Need to differentiate where the bottleneck is:
If it's the remoting system, there's things you can do to adjust but ultimately we have to replace our transport system (it's the lion's share of the v1.5 roadmap)
If it's the sharding system we can patch that too - opened a similar issue on the entity spawning side last week: https://github.com/akkadotnet/akka.net/issues/5190
Can you give us some color here to inform these numbers? We'll take a look at the source too.
Thanks, in reply to:
ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.
The local actor - is this actor hosted by the sharding system or is just a local actor floating in-memory?
Correct it's hosted by the sharding system, in both cases the REST endpoint performs this call:
var resPong = await _shardRegion.Ask<MsgPong>(new ShardEnvelope<MsgPing>(entityId, new MsgPing()));
So the "local" test is actually a call to the REST endpoint hosted on the machine where the sharding system has spawned the actor.
Happy to provide further details or tests that can help.
Thanks
My working theory on this is, assuming you're getting both numbers from the same sample:
Shard
s passing through the ShardRegion
introduces additional routing latency - especially since you're not pipelining in that code sample;ShardRegion
actor and Shard
actors themselves, functioning as routers, are doing so in an inefficient manner and introduces latency in both local + remote scenarios, but it's more visible in the latter for reasons related to the additional routing that has to occur;For running this test against a remote sharded actor in your scenario, do you just run two processes and target a entity id not hosted on the HTTP target node?
Correct, basically the setup is as follows: instance-1 : Windows VM running an instance of the code in the repo instance-2 : Windows VM running an instance of the code in the repo benchmark-machine : Makes calls to either instance-1 or instance-2
From benchmark-machine, curl http://instance-1:5000/5
returns akka://ping-pong-cluster-system/system/sharding/PingPongActor/13/5(pid=2372,hostname=instance-2
, so we know that in this case entity id = 5 is hosted by the sharding system on instance-2 machine.
So hitting the endpoint http://instance-2:5000/5 constitutes a call to the local actor, whereas http://instance-1:5000/5 constitutes a call to the remote actor
Thanks! I'll take a look at this and at the very least, resolve https://github.com/akkadotnet/akka.net/issues/3083
We've been able to reproduce the 2.4k msg/s figure exactly in the benchmarks we created on #5209
I'm working on using Phobos to do some end-to-end tracing on the shard routing system to see where the most time is piling up. Did a read of the code last night and nothing obvious jumps out at me, but there is a drastic difference in remote vs. local shard performance that is clearly visible and consistent across different hardware profiles.
First time a shard + entity is allocated, Phobos graph:
End to end trace of an already-created entity actor from a remote host:
A local shard routed version of that chart, for comparison:
Another example of slow message handling inside the shard / shard region actors going across node
I think I know where to look now - it looks like there's a combination of:
ShardRegion
but that won't show up explicitly in the traces here.I think I have enough data to go off of - I'll get started on improving the figures in our benchmark.
Having done some checks I believe the perf. bottleneck is coming from the amount of boxing/unboxing that is performed by the ShardMessageExtractor
The attached ZIP contains: patch_without-unboxing.diff - a patch that avoids unboxing in an extremely hacky way patch_with-unboxing.diff - a patch that is the same as patch_without-unboxing.diff but simply enables boxing/unboxing patches.zip
Without unboxing, the benchmark SingleRequestResponseToRemoteEntity
results in ~350ms/op:
Simply adding boxing/unboxing drops perf to ~4s/op:
In comparison, on the same machine, the benchmark SingleRequestResponseToLocalEntity
is also at ~350ms/op:
Possibly an approach could be a generic version of IMessageExtractor
that then avoids the boxing/unboxing overhead in the custom ShardMessageExtractor
implementation ? I believe this is the way it is being done in the JVM
Thanks
@carl-camilleri-uom that's great work. Explains why this issue is unique to sharding. I also noticed we're doing some things like converting the IMessageExtractor
into a delegate at startup inside the ShardRegion
, which likely does not help either.
Adding a generic IMessageExtractor
should be doable, but we'll still need to clean up the delegate mess inside ShardRegion
. I'll see if I can deal with the former first.
We can also add a benchmark for the HashCodeMessageExtractor
if there isn't one already.
The changes you included in your patch file - I can't get them to run.
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
---> System.ArgumentNullException: The message cannot be null. (Parameter 'message')
at Akka.Actor.Envelope..ctor(Object message, IActorRef sender, ActorSystem system) in D:\Repositories\olympus\akka.ne
t\src\core\Akka\Actor\Message.cs:line 28
at Akka.Actor.ActorCell.SendMessage(IActorRef sender, Object message) in D:\Repositories\olympus\akka.net\src\core\Ak
ka\Actor\ActorCell.cs:line 418
at Akka.Actor.Futures.Ask[T](ICanTell self, Func`2 messageFactory, Nullable`1 timeout, CancellationToken cancellation
Token) in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 143
at Akka.Actor.Futures.Ask[T](ICanTell self, Object message, Nullable`1 timeout, CancellationToken cancellationToken)
in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 105
at Akka.Cluster.Benchmarks.Sharding.ShardMessageRoutingBenchmarks.SingleRequestResponseToRemoteEntity() in D:\Reposit
ories\olympus\akka.net\src\benchmark\Akka.Cluster.Benchmarks\Sharding\ShardMessageRoutingBenchmarks.cs:line 112
at BenchmarkDotNet.Autogenerated.Runnable_1.__Workload()
at BenchmarkDotNet.Autogenerated.Runnable_1.WorkloadActionUnroll(Int64 invokeCount)
at BenchmarkDotNet.Engines.Engine.RunIteration(IterationData data)
at BenchmarkDotNet.Engines.EngineFactory.Jit(Engine engine, Int32 jitIndex, Int32 invokeCount, Int32 unrollFactor)
at BenchmarkDotNet.Engines.EngineFactory.CreateReadyToRun(EngineParameters engineParameters)
at BenchmarkDotNet.Autogenerated.Runnable_1.Run(BenchmarkCase benchmarkCase, IHost host)
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor, Boo
lean wrapExceptions)
at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters
, CultureInfo culture)
at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)
at BenchmarkDotNet.Toolchains.InProcess.Emit.Implementation.RunnableProgram.Run(BenchmarkId benchmarkId, Assembly par
titionAssembly, BenchmarkCase benchmarkCase, IHost host)
ExitCode != 0 and no results reported
No more Benchmark runs will be launched as NO measurements were obtained from the previous run!
Could you send them as a PR instead?
So this is really an Akka.Remote performance issue. I replicated the actor hierarchy in its essence here: https://github.com/Aaronontheweb/RemotingBenchmark
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
Method | MsgCount | Mean | Error | StdDev |
---|---|---|---|---|
SingleRequestResponseToLocalEntity | 10000 | 3.575 s | 0.0681 s | 0.0569 s |
We can shave off maybe 25-30% of the performance overhead by improving the sharding system's efficiency of message handling, but it still comes down to Akka.Remote.
What's interesting here is that the benchmark you designed is an absolute worst case performance scenario for Akka.Remote:
this use case is pretty interesting too because:
Ask<T>
operations eliminates our ability to cache deserialized IActorRef
s on both sides of the connection;await
-ing each call eliminates our ability to batch in Akka.Remote and in all of the actors' mailboxes; andawait
-ing each request / response pair blocks flow for the entire stream, meaning one request can't start before the previous one completes.Factor 3 is the most expensive to overcome - for the sake of comparison, if I change
[Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
for (var i = 0; i < MsgCount; i++)
await _sys2Remote.Ask<ShardedMessage>(_messageToSys2);
}
To
[Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
var tasks = new List<Task>();
for (var i = 0; i < MsgCount; i++)
tasks.Add(_sys2Remote.Ask<ShardedMessage>(_messageToSys2));
await Task.WhenAll(tasks);
}
The performance profile changes to:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
Method | MsgCount | Mean | Error | StdDev |
---|---|---|---|---|
SingleRequestResponseToLocalEntity | 10000 | 637.2 ms | 8.48 ms | 7.51 ms |
That looks more like it to me - all of the tasks were completed in the same order but they were just all allowed to start at the same time. If I change this to use Tell
we get this entire workload done in about 110ms using Sharding, which is about ~90k msg/s (sounds about right.)
Here's what I don't understand about your benchmark @carl-camilleri-uom - you ran this:
wrk -t48 -c400 -d30s http://instance-1:5000/5
And got
Running 30s test @ http://instance-1:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 254.21ms 286.27ms 1.02s 78.54%
Req/Sec 119.09 127.71 0.94k 85.22%
73325 requests in 30.03s, 14.55MB read
Socket errors: connect 0, read 0, write 216, timeout 0
Requests/sec: 2441.41
Transfer/sec: 495.91KB
Is wrk
forcing all of its HTTP requests to that endpoint to run in sequence or in parallel? Because by the looks of your numbers they line up exactly with what would happen if the requests were run only in sequence. In a real world scenario with multiple concurrent requests for the same entity arriving all at once you'd expect results similar to my second set of benchmark numbers. What can account for the difference there?
@Aaronontheweb thanks for this information.
First of all, apologies for the initial indication that boxing/unboxing was causing the issue - it was a red herring as indeed with your benchmark on HashCodeMessageExtractor I'm not able to replicate any latency.
Secondly, thanks also for the details. provided.
With regards to the workload submitted by wrk
, I believe this is parallelised by the number of connections according to the specs at https://github.com/wg/wrk.
Thus wrk -t48 -c400 -d30s http://instance-2:5000/5
should be executing 400 concurrent connections allocated over 48 threads on the benchmark machine.
I have also tried tests using wrk2
(https://github.com/giltene/wrk2) with the same results.
I have been testing some further approaches as follows and the repo at https://github.com/carlcamilleri/benchmark-akka-cluster has been updated as follows:
The same application exposes a new endpoint /batch/{entityId}
100k messages to the entity actor and awaits them all.
A single request to this endpoint and requesting a remote entity completes within an average of 2.56s which would work out at ~40k req/s and is even in-line with your benchmarks when awaiting a set of Tasks:
The endpoint /{entityId}
is now taking another approach whereby the request received on the API is sent to a local Actor Router, which then handles the request to the cluster without explicitly await
-ing, although of course the web server would wait for the response from the actor to complete the HTTP request. This still re-produces the issue: I don't get more than 3k req/s
I guess therefore my question is whether there is perhaps a better way to approach this problem? Basically, the problem at hand could be described simply as the need to implement a REST API that gives back the details of a business entity that is cached within a sharded cluster.
The API consumers are of course independent and the API is synchronous i.e. the API consumer would need to wait for the response from the API. If I understand well, this is what is introducing the problem (at least with my approach) i.e. where the different (parallel, mutually-exclusive) requests on the API are being handled in a serial manner on the cluster.
Thank you
@to11mtm @Aaronontheweb thanks for the analysis in #5230 . Just to confirm do we expect this to improve performance even in the case of the following approach? :
For reference I have benchmarked this scenario (https://github.com/carlcamilleri/benchmark-akka-cluster/blob/master/Startup.cs) which is still performing poorly
Thanks
@to11mtm @Aaronontheweb thanks for the analysis in #5230 . Just to confirm do we expect this to improve performance even in the case of the following approach? :
For reference I have benchmarked this scenario (https://github.com/carlcamilleri/benchmark-akka-cluster/blob/master/Startup.cs) which is still performing poorly
Thanks
Looking at said benchmark however, I would suggest:
@to11mtm thanks, I have added a small benchmark as part of ShardMessageRoutingBenchmarks
to test this approach with a RoundRobinPool of 50, and even increased to 1000.
The results are below: I still get >3s to complete 10000 messages, so it seems not using ASK
on the remote path does not improve much on the results from SingleRequestResponseToRemoteEntity
Thanks
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.206
[Host] : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT
Job-IUOUID : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Method | StateMode | MsgCount | Mean | Error | StdDev |
---|---|---|---|---|---|
SingleRequestResponseToRemoteEntityWithLocalProxy | Persistence | 10000 | 3.113 s | 0.0616 s | 0.0710 s |
SingleRequestResponseToRemoteEntityWithLocalProxy | DData | 10000 | 3.126 s | 0.0417 s | 0.0390 s |
The results are below: I still get >3s to complete 10000 messages, so it seems not using ASK on the remote path does not improve much on the results from SingleRequestResponseToRemoteEntity
Your performance issue here is flow control primarily:
[Benchmark]
public async Task SingleRequestResponseToRemoteEntityWithLocalProxy()
{
for (var i = 0; i < MsgCount; i++)
await _localRouter.Ask<ShardedMessage>(new SendShardedMessage(_messageToSys2.EntityId, _messageToSys2));
}
It's the await
that is killing you, as I pointed out here: https://github.com/akkadotnet/akka.net/issues/5203#issuecomment-902828065
I think there is something wrong with your HTTP benchmark as you should be seeing throughput numbers similar to where there are lots of concurrent Ask calls - performance more on the order of hundreds of ms, not thousands, for 10k messages. That what you're seeing lines up nearly identically with our serial Ask<T>
numbers makes me suspicious. I wonder if it's a synchronization context issue?
@Aaronontheweb I understand it's not ideal to await
however I'm not sure I understand what should be the correct approach for the use case at hand (https://github.com/akkadotnet/akka.net/issues/5203#issuecomment-903241323) ?
In principle I am using the approach described at https://petabridge.com/blog/akkadotnet-aspnet/ for Request / Response . Perhaps there's a better approach?
Furthermore, using the same approach for a local entity gives acceptable throughput even with await
:
public async Task SingleRequestResponseToLocalEntity()
{
for (var i = 0; i < MsgCount; i++)
await _shardRegion1.Ask<ShardedMessage>(_messageToSys1);
}
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.206
[Host] : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT
Job-WQMFMH : .NET Core 3.1.18 (CoreCLR 4.700.21.35901, CoreFX 4.700.21.36305), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Method | StateMode | MsgCount | Mean | Error | StdDev |
---|---|---|---|---|---|
SingleRequestResponseToLocalEntity | Persistence | 10000 | 63.94 ms | 1.260 ms | 2.139 ms |
SingleRequestResponseToLocalEntity | DData | 10000 | 65.08 ms | 1.233 ms | 2.254 ms |
With the SingleRequestResponseToRemoteEntityWithLocalProxy
I have tried to come up with an approach where it's not relying on an ASK
for a remote operation, but it seems that it doesn't solve the problem?
Thank you
I understand it's not ideal to await however I'm not sure I understand what should be the correct approach for the use case at hand
Your original benchmark code was just fine for implementing request-response - but when many concurrent HTTP clients are all hitting that endpoint, all of their Ask<T>
s are running in parallel. That's not the case with our benchmarks where we're getting 3 seconds per 10,000 messages - all of those Ask<T>
s wait in sequence. That's the problem: concurrent vs. sequential. Has nothing to do with Ask<T>
- it's whether many requests can be in-flight at once or whether they get serialized into a 1 by 1 by 1 sequence. Does that make sense - that what you're seeing is the combination of real I/O (difference between 5ms and 500ms in our local vs. remote Tell
benchmarks) and flow control (difference between 500ms and ~3s in Tell vs. Ask)? That the issue is that the way those Ask benchmarks are structured, one request can't begin until another completes?
What's weird is that your benchmark was designed to be concurrent but its performance numbers match the sequential access benchmarks - that's what I don't understand.
Furthermore, using the same approach for a local entity gives acceptable throughput even with await :
That's all happening in-memory and there's no real I/O involved, so the numbers will be good. We know that Akka.Remote definitely has some performance constraints since it's all running over a single TCP connection per node and we've plans to fix that, but it's a bigger project.
Since writing my last comment we've improved the performance of Akka.Remote by ~40%: https://github.com/akkadotnet/akka.net/pull/5247
However, the improvement in the Akka.Cluster.Sharding benchmark is only about ~15% - the thing that really moved the needle was our improvements around IActorRef
resolution in the deserialization pipeline, which has a high impact on the temporary actors that power Ask<T>
and thus can't be cached on either read-side (receiver in, sender in). That also helped speed up child lookups inside the ShardRegion
too but the effects of that are pretty small outside of a micro-benchmark.
As to why this benchmark isn't benefitting linearly from the Akka.Remote improvements, it likely has to do with not being able to take advantage of I/O batching or caching on either side due to how the temporary actors work and how each request --> response pair has to complete synchronously in this benchmark as it's written. I'd like to run @carl-camilleri-uom 's original HTTP benchmark first and get some more data from that, since that benchmark should benefit from I/O batching but that's not apparent from his numbers.
I think there's more that could be done to speed up some individual operations happening on the read-side in particular, since we're pretty certain that's where the bottleneck is in our system, but that's going to take some more trial-and-error to find.
I'm also going to look into speeding up the operations of the sharding actors themselves - we know they're wasteful: https://github.com/akkadotnet/akka.net/pull/5217
Hi,
Just in case it is of interest, I have in the meantime performed similar benchmarks in the JVM world.
I have reported my findings at https://github.com/akka/akka/issues/30546.
Throughput is faster on Akka (JVM) but I still replicate a massive drop in throughput when comparing calls to local actors vs. calls to a remote actor in the cluster. In one particular test, the drop went from >120k req/s to <15k req/s.
One very empirical observation that I get from benchmarking both Akka.Net and Akka (JVM) is that I am never able to saturate the CPU of the cluster machines when calling a remote entity, whereas the CPU is easily saturated when calling a local entity in the cluster.
This behaviour does not seem to be related to network I/O - for interest I have tried setting up a GRPC endpoint and tested it under the same conditions i.e. same benchmark, but instead of making a call to the entity shard, making a call to the remote GRPC endpoint. In this case, CPU on both machines is saturated and throughput remains approximately the same as if a local actor is called.
At this stage I start wondering whether this is due to a single actor being responsible to marshal calls on a per-shard basis. And therefore any request to the same shard has to go through a single actor.
This may be a stupid idea as I do not have full knowledge of the internals of the Akka clustering working, so apologies in advance!
Maybe I'm also hitting a bit of an Achilles heel but I think my use case is pretty much a very common use case of having an Akka(.net) cluster serving as a distributed cache of objects, without uniform traffic i.e. some of those objects (or indeed, shards) would be responsible for the majority of the traffic.
Thanks
At this stage I start wondering whether this is due to a single actor being responsible to marshal calls on a per-shard basis. And therefore any request to the same shard has to go through a single actor.
This may be a stupid idea as I do not have full knowledge of the internals of the Akka clustering working, so apologies in advance!
It's definitely not stupid! Since you reported your issue we've just about doubled the throughput of Akka.Remote but that hasn't translated into a linear correlation in shard performance, which indicates exactly what you suspect: a bottleneck somewhere inside the sharding system.
I think it's very likely the ShardRegion
--> ShardRegion
--> Shard
messaging path that is acting like the bottleneck in this case. In the local scenario the messages only have to follow a ShardRegion
--> Shard
-- > Entity
path, but that latency might well be paved over with the fact that there's no real I/O at play.
Ultimately there still needs to be a system for guaranteeing that there's exactly one instance of each entity actor, which is why this exists - but based on some work I and others have done, I think we can make the existing implementation more efficient. Redesigning sharding to remove the bottleneck is a tall order, but reducing the head-of-line blocking inside the Shard
actors is achievable.
@Aaronontheweb some notes:
By just forwarding, this will give us some insight as to the overhead of forwarding between actors as opposed to what we are doing in those actors themselves.
but that latency might well be paved over with the fact that there's no real I/O at play.
One big difference between local with serialize-messages
and remote is that the protobuf serialization/deserialization and remote actorref resolve stages are elided.
Having stared at a lot more profiler trees than I'd like to admit, I can say the following:
TouchLastMessageTimestamp
is called in Shard Delivery and adds up; if passivate-idle is set to Off (And one managed manually with ReceiveTimeout) you might be able to eke out a little more performance. I'm making a very blind assumption that using ReceiveTimeout on an actor is going to be a tradeoff, since that is getting put on the main scheduler, so you'll pay the cost there instead of in the sharding pipeline.ForkJoinExecutor
is not hard to do and should have a small payoff.UnfairSemaphore.Release
.ReadOnlyMemory<byte>
overloads for Deserialization in an overridable fashion (i.e. virtual
that just calls byte[]
version with .ToArray()
) will let us build 'opt-in' deserialization where copies can be avoided by remoting as well as others.UnsafeByteOperations.UnsafeWrap
option that lets you use a byte array as a byte string without a copy. You are on your honor to not modify the byte array after doing so (we don't in remote, so it would be a safe op)Ask
Cache. It -kinda- works. The challenge is that either we have to pay for Cycle on add, or manually Remove on the LRU somewhere. While offloading that work to another actor helps, the cost of putting a message onto that actor's mailbox isn't -that- much better, and then we need to give it a thread on the remote dispatcher to avoid getting in the way of remoting, which is another cost.If we were able to spec serializer IDs as having a max value, we probably could...
The literature on those is a max value of 1024 with 0-99 reserved for internal Akka.NET use.
I really, really, really wish we could get rid of DotNetty. Or at least update it.
That makes two of us. The new version of DotNetty does replace some of its threading internals but the results were gruesome. I want to push on the gas pedal for Akka.NET v1.5 and Artery but my docket is full of other, mostly Petabridge-related issues at the moment.
The literature on those is a max value of 1024 with 0-99 reserved for internal Akka.NET use.
So, what we could do here, is allocate an array of 1280 Serializers up front. We do this because I am 98% sure we have some internal serializers defined with negative IDs, so everything in the array is shifted +256. This makes Serializer lookups by ID a simple-ish array check instead of a dictionary lookup. This will increase memory usage by 5/10KB (32/64 bit) but it will be a fixed cost. Bonus, this will also help scenarios in Persistence and elsewhere that we look up the Serializer by ID.
I really, really, really wish we could get rid of DotNetty. Or at least update it.
That makes two of us. The new version of DotNetty does replace some of its threading internals but the results were gruesome. I want to push on the gas pedal for Akka.NET v1.5 and Artery but my docket is full of other, mostly Petabridge-related issues at the moment.
Didn't see this before, so I'll say it here; maybe SpanNetty would be worth a look from a low-effort gain standpoint. The other low-effort items would be ForkJoin delegate removal, UnsafeWrap in Remote, Serializer IDs, and doing some EqualityComparer defs on various collections.
Measuring Sharding performance with some of the serialization changes that will be introduced in v1.4.30 - these are mostly memory / allocation optimizations so the impact on throughput should be minor.
// Summary
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1348 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.100
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
Job-TMZJWF : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Method | StateMode | MsgCount | Mean | Error | StdDev |
---|---|---|---|---|---|
-----------: | |||||
SingleRequestResponseToLocalEntity | Persistence | 10000 | 112.852 ms | 2.2494 ms | |
3.5020 ms | |||||
StreamingToLocalEntity | Persistence | 10000 | 4.797 ms | 0.2881 ms | |
0.8267 ms | |||||
SingleRequestResponseToRemoteEntity | Persistence | 10000 | 3,532.257 ms | 11.9092 ms | |
10.5572 ms | |||||
SingleRequestResponseToRemoteEntityWithLocalProxy | Persistence | 10000 | 4,060.152 ms | 7.0778 ms | |
5.9103 ms | |||||
StreamingToRemoteEntity | Persistence | 10000 | 412.605 ms | 5.3912 ms | |
5.0429 ms | |||||
SingleRequestResponseToLocalEntity | DData | 10000 | 108.035 ms | 2.1032 ms | |
2.5037 ms | |||||
StreamingToLocalEntity | DData | 10000 | 4.682 ms | 0.2615 ms | |
0.7420 ms | |||||
SingleRequestResponseToRemoteEntity | DData | 10000 | 3,559.191 ms | 11.0759 ms | |
10.3604 ms | |||||
SingleRequestResponseToRemoteEntityWithLocalProxy | DData | 10000 | 4,057.514 ms | 8.6033 ms | |
7.6266 ms | |||||
StreamingToRemoteEntity | DData | 10000 | 407.138 ms | 3.0236 ms | |
2.8283 ms |
// Summary
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1348 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.100
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
Job-SWGOFT : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Method | StateMode | MsgCount | Mean | Error | StdDev |
---|---|---|---|---|---|
-----------: | |||||
SingleRequestResponseToLocalEntity | Persistence | 10000 | 112.381 ms | 2.2262 ms | |
2.8155 ms | |||||
StreamingToLocalEntity | Persistence | 10000 | 5.058 ms | 0.2931 ms | |
0.8504 ms | |||||
SingleRequestResponseToRemoteEntity | Persistence | 10000 | 3,518.615 ms | 10.0682 ms | |
8.9252 ms | |||||
SingleRequestResponseToRemoteEntityWithLocalProxy | Persistence | 10000 | 4,024.819 ms | 7.6202 ms | |
6.7551 ms | |||||
StreamingToRemoteEntity | Persistence | 10000 | 405.815 ms | 3.9392 ms | |
3.6847 ms | |||||
SingleRequestResponseToLocalEntity | DData | 10000 | 109.762 ms | 2.1667 ms | |
3.3088 ms | |||||
StreamingToLocalEntity | DData | 10000 | 4.806 ms | 0.2646 ms | |
0.7550 ms | |||||
SingleRequestResponseToRemoteEntity | DData | 10000 | 3,587.875 ms | 15.3351 ms | |
13.5941 ms | |||||
SingleRequestResponseToRemoteEntityWithLocalProxy | DData | 10000 | 4,065.668 ms | 8.6141 ms | |
7.6362 ms | |||||
StreamingToRemoteEntity | DData | 10000 | 402.225 ms | 4.5253 ms | |
4.2329 ms |
Version Information Version of Akka.NET? 1.4.23 Which Akka.NET Modules? Akka.Cluster.Sharding
Describe the performance issue
A minimum viable repo that reproduces the issue is at https://github.com/carlcamilleri/benchmark-akka-cluster
Running two n2-standard-8 nodes in GCP (8 CPU @ 2.80 GHz and 32GB RAM) with Windows Server 2019 in GCP ("instance-1" and "instance-2"), and a third machine to run the benchmarks from
First check:
curl http://instance-1:5000/5
Response:akka://ping-pong-cluster-system/system/sharding/PingPongActor/13/5(pid=2372,hostname=instance-2)
Therefore entity id 5 actor is hosted on instance-2 server
wrk -t48 -c400 -d30s http://instance-2:5000/5
wrk -t48 -c400 -d30s http://instance-1:5000/5
For interest I've also repeated test (1) (i.e. workload on the endpoint which requests the local actor) but with
serialize-messages = on
, and the result is:So Hyperion serialisation drops throughput from >111k to >85k, which is probably expected
Data and Specs
ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.
Expected behavior Cross-Machine communication in Cluster Sharding expected to be faster.
Actual behavior Cross-Machine communication in Cluster Sharding seems to be extremely slow and unusable for my use case (an OLTP workload)
Environment .NET 5.0 Windows Server 2019 n2-standard-8 machine in GCP (8 CPU @ 2.80 GHz and 32GB RAM)