Closed Zetanova closed 3 years ago
The ConcurrentQueue
in the class Helios.Concurrency.DedicatedThreadPool.ThreadPoolWorkQueue
could be replaced with the new System.Threading.Channels
API of the dotnet/runtime
This would get rid of the UnfairSemaphore
implementation for good or bad.
I don't have the setup/knowledge to measure the perf effects of this change, can test it?
I made a branch https://github.com/Zetanova/akka.net/tree/helios-idle-cpu with a commit that changes the ThreadPoolWorkQueue
to System.Threading.Channels.Channel
Please, can somebody run a test and benchmark
Or explain me how to run get the Akka.MultiNodeTestRunner.exe started.
Cc @to11mtm - guess I need to move up the time table on doing that review
@Zetanova I’ll give your branch a try - OOF for a couple of days but I’ll get on it
@Zetanova I'll try to run this through the paces as well in the next few days. :)
I made a helios-io/DedicatedThreadPool fork https://github.com/Zetanova/DedicatedThreadPool/tree/try-channels
The problem is that the benchmark does not count the spin waits / idle CPU
--------------- RESULTS: Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark --------------- --------------- DATA --------------- TotalBytesAllocated: Max: 3 227 648,00 bytes, Average: 3 220 716,31 bytes, Min: 3 219 456,00 bytes, StdDev: 3 076,37 bytes TotalBytesAllocated: Max / s: 231 256 177,45 bytes, Average / s: 192 894 727,69 bytes, Min / s: 148 458 491,46 bytes, StdDev / s: 28 365 516,34 bytes
TotalCollections [Gen0]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen0]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections
TotalCollections [Gen1]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen1]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections
TotalCollections [Gen2]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen2]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections
[Counter] BenchmarkCalls: Max: 100 000,00 operations, Average: 100 000,00 operations, Min: 100 000,00 operations, StdDev: 0,00 operations [Counter] BenchmarkCalls: Max / s: 7 183 082,40 operations, Average / s: 5 989 015,64 operations, Min / s: 4 611 291,21 operations, StdDev / s: 879 635,08 operations
------------ FINISHED Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ----------
--------------- RESULTS: Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark --------------- --------------- DATA --------------- TotalBytesAllocated: Max: 3 219 456,00 bytes, Average: 3 219 456,00 bytes, Min: 3 219 456,00 bytes, StdDev: 0,00 bytes TotalBytesAllocated: Max / s: 233 222 932,15 bytes, Average / s: 205 354 466,34 bytes, Min / s: 157 580 871,74 bytes, StdDev / s: 28 045 400,62 bytes
TotalCollections [Gen0]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen0]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections
TotalCollections [Gen1]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen1]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections
TotalCollections [Gen2]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections TotalCollections [Gen2]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections
[Counter] BenchmarkCalls: Max: 100 000,00 operations, Average: 100 000,00 operations, Min: 100 000,00 operations, StdDev: 0,00 operations [Counter] BenchmarkCalls: Max / s: 7 244 172,06 operations, Average / s: 6 378 545,52 operations, Min / s: 4 894 642,81 operations, StdDev / s: 871 122,35 operations
------------ FINISHED Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ----------
Event if the default implementation of the dotnet/runtime does not fit, we could implement a custom Channel and reuse it elsewhere, event if its only the ChannelReader subpart
To increase and decrease the thread-workers is easily possible. Because all awaiting thread-workers are awoken on new work, thread-workers can count misses in there loop and they can stop by them self or the DTP can mark them to stop.
Even a Zero-Alive thread-worker scenario could be possible.
From the Pro: https://devblogs.microsoft.com/dotnet/an-introduction-to-system-threading-channels/
Detail blog post: https://www.stevejgordon.co.uk/dotnet-internals-system-threading-channels-unboundedchannel-part-1
@Zetanova I ran some tests against the branch in #4594 to see whether this helped/hurt.
Background: On my local machine, under RemotePingPong, Streams TCP Transport gets up to 300k messages/sec if everything runs on the normal .NET Threadpool.
DedicatedThreadPool
- 180,000-220,000 Msg/sec (This variance increased when the internal-dispatcher changes were merged in @Aaronontheweb, but I'm going to blame my dirty laptop to some extent)System.Threading.Channels
based DedicatedThreadPool
- 130,000-170,000 Msg/secSystem.Threading.Channels
based DedicatedThreadPool
, AllowSynchronousContinuations = true,
- 200,000-220,000 Msg/SecI think this could be on the right track, I know that UnfairSemaphore
has been somewhat supplanted/deprecated in Core at this point too.
@to11mtm thx for the test, because i didnt know what continues with AllowSynchronousContinuations
so i didn't set it.
Kestrel is using System.Thrading.Channels
for the connection management.
What's importent is to test the idle state of a cluster under windows and/or linux. The current akka 1.4.12 has 100-120m in a k8s cluster per akka-node idling. and on my dev machine with a 3 CPU limit for docker wsl2 25% per node idling. It does not matter if the node has custom actors running or is only connected to 'empty' to the cluster.
On my on-premise k8s cluster it does not matter that much, but on an AWS or AZURE it does a lot. Near all EC2 Instance without "unlimited" supports more then 20% CPU/Core constant load.
I will try now to implement an autoscaler for the DTP.
@Zetanova I think it's definitely on the right path. If you can auto-scale that might help too, what I noticed in the profiler is we still have all of these threads waiting for channel reads very frequently, I'm not sure if there's a cleaner way to keep them fed...
Sorry, one more note...
I wonder whether we should peek at Orleans Schedulers for some inspiration?
At [one point] (https://github.com/dotnet/orleans/pull/3792/files) they were actually using a variation of our Threadpool complete with credited borrowing of UnfairSemaphore. It doesn't look like they use that anymore, so perhaps we can look at how they evolved and take some lessons.
This looks like a relatively simple change @Zetanova @to11mtm - very interested to see what we can do. Replacing the DedicatedThreadPool with System.Threading.Channels makes a lot of sense to me.
I wonder whether we should peek at Orleans Schedulers for some inspiration? At [one point] (https://github.com/dotnet/orleans/pull/3792/files) they were actually using a variation of our Threadpool complete with credited borrowing of UnfairSemaphore. It doesn't look like they use that anymore, so perhaps we can look at how they evolved and take some lessons.
I'm onboard with implementing good ideas no matter where they come from. The DedicatedThreadPool abstraction was something we created back in... must have been 2013 / 2014. It's ancient. .NET has evolved a lot since in terms of the types of scheduling primitives it allows.
I think a major part of the issue with the DedicatedThreadPool
, as this is was something we looked at prior to the internal dispatcher change, is that it pre-allocates all threads up front - therefore you're going to have a lot of idle workers lying around checking for work in systems that aren't busy. The design should be changed to auto-scale threads up and down.
I suggested a few ways of doing this - one was to put a tracer round in the queue and measure how long it took to make it to the front. The other was to measure the growth in the task queue and allocate threads based on growth trends. Both of these have costs in terms of complexity and raw throughput, but the advantage is that in less busy or sporadically busy system they're more efficient at conserving CPU utilization.
Looks like the CLR solves this problem via a Hill-climbing algorithm to continually try to optimize the thread count https://github.com/dotnet/runtime/blob/4dc2ee1b5c0598ca02a69f63d03201129a3bf3f1/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.HillClimbing.cs
Based on the data from this PR that @to11mtm referenced: https://github.com/dotnet/orleans/pull/6261
An idea: the big problem we've tried to solve by having separate threadpools was ultimately caused by the idea of work queue prioritization - that some work, which is time sensitive, needs to have a shorter route to being actively worked on than others.
The big obstacle we've run into historically with the default .NET Threadpool was that its work queue can grow quite large, especially with a large number of Task
s, /user
actor messages, and so on - and as a result of this the /system
actors, whose work is much more concentrated and time sensitive, suffered as a result.
What if we solved this problem by having two different work queues routing to the same thread pool rather than two different work queues routing to separate thread pools? If we could move the /system
and /user
actors onto separate dispatchers, each with their own work queue (which we'd have to implement by creating something that sits above the Threadpool
, i.e. a set of separate System.Threading.Channels.Channel<T>
instance), but both of them still used the same underlying threads to conduct the work.
The problems that could solve:
The downsides are that outside of the Akka.NET dispatchers, anyone can queue work onto the underlying threadpool - so we might see a return of the types of problems we had around Akka.NET 1.0 where time sensitive infrastructure tasks like Akka.Remote / Akka.Persistence time out due to the length of the work queue.
I'd be open to experimenting with that approach too and ditching the idea of separate thread pools entirely.
The downsides are that outside of the Akka.NET dispatchers, anyone can queue work onto the underlying threadpool - so we might see a return of the types of problems we had around Akka.NET 1.0 where time sensitive infrastructure tasks like Akka.Remote / Akka.Persistence time out due to the length of the work queue.
Perhaps then it makes sense to keep the existing one around if this route is taken? that way if you are unfortunately having to deal with noisy code for whatever reason in your system, you can at at least 'pick your poison'.
This does fall into the category of 'Things that are easier to solve in Net Core 3.1+'; 3.1+ lets you look at the work queue counts, at that point we could 'spin up' additional threads if the work queue looks too long.
Yes the DedicatedThreadPool is not ideal. I am currently working on it only to simplify it and maybe remove the idle-cpu issue.
2-3 channels to queue work on priority inside a single dispatcher would be the why to go. channel-3: instantly/work stealing channel-2: high/short work channel-1: normal/long work
The queue algo could be very simple like: 1) Queue or maybe execute all work from channel-3 2) Queue view work items from channel-2 3) If channel-3 or channel-2 had work then queue only one work item from channel-1 else queue view work items from channel-1 4) if there was no work then wait on channel-1, channel-2 or channel-3 5) repeat with 1)
Maybe channel-3 is not needed and a flag to directly direct execute the work-item can be used.
If an external source queues to much work on the ThreadPool, other components like Sockets will have issues, not only Akka.Remote Akka.net should not try to "resolve" this external issue,
I will try to look into the Dispatcher next after DedicatedThreadPool.
@to11mtm pls benchmark my commit again https://github.com/Zetanova/akka.net/tree/helios-idle-cpu I added a simple auto-scaler
If possible pls form a 5-7 node cluster and look at the idle CPU state.
Maybe if somebody has time to explain to me how to start the benchmarks and MultiNode tests. Somehow i don't get it.
@Zetanova - Looks like this last set of changes impacted thorughput negatively; it looks like either we are spinning up new threads too slowly, or there's some other overhead negatively impacting us as we try to ramp up.
What I'm measuring is the Messages/Sec of RemotePingPong on [this branch])(https://github.com/to11mtm/akka.net/tree/remote-full-manual-protobuf-deser); If you can build it you should be able to run it easily enough.
Edit: It's kinda all over the place with this set of changes, anywhere from 100,000 to 180,000 msg/sec
If possible pls form a 5-7 node cluster and look at the idle CPU state.
Unfortunately I don't have a cluster setup handy that I can use for testing this, Won't have time to set one up for quite some time either :(
@to11mtm thx for the run. on how many cores are u testing? maybe it is just that i set the max thread count to Environment.ProcessorCount-1; https://github.com/Zetanova/akka.net/blob/61a0d921d74ac10b8aaba6bc09cc0f25bff87ed3/src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs#L53
Currenty the scheduler checks every 50 work items to reschedule.
There is now a _waitingWork
counter, that we could use to force an thread increase.
But the main problem is not to support max throughput, its to test if it scales down and/or the idle CPU issue gets resolved.
@to11mtm I checked again and found a small error and made a new commit.
misstake is _cleanCounter = 0
should be _cleanCounter = 1
https://github.com/Zetanova/akka.net/blob/7dd6279dac948dea23bd87d252717fc28ea9728a/src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs#L328-L333
Else it should be more or less the same like on the fist commit without the auto-scaler.
It sets up MaxThread from the start and scales down only if there is very low work count. Under load like PingPong and RemotePingPong there is no down scaling happening.
I could not run RemotePingPong because of some Null execption on startup, but PingPong did, looked ok.
My CPU run with 'only' 60% thats because of the Intel Hyper-Threading
@Aaronontheweb Could take a look? If it takes longer then I would need to replace my intel i7-920 of my dev machine after 11years.
@Zetanova haven't been able to get RemotePingPong to run on my machine with these changes yet - it just idles without running
Idea I'm going to play with - moving all dispatchers to use shared / separate TaskScheduler
s that run on the default .NET threadpool, rather than separate thread pools.
@Aaronontheweb This is the simplest one and most likely the best performant Try this branch, it uses the normal ThreadPool: https://github.com/Zetanova/akka.net/tree/helios-idle-cpu-pooled
It does work, but akka is not using DedicatedThreadPoolTaskScheduler only DedicatedThreadPool for the ForkJoinExecutor.
I'm taking some notes as I go through this - we really have three issues here:
Thread
implementations are inefficient at managing scenarios where there isn't enough scheduled work to do - this is true for DotNetty, the scheduler, and the DedicatedThreadPool
. Not a problem that anyone other than the .NET ThreadPool
has solved well. Automatically scaling the thread pools up and down with demand would solve a lot of those problems. Hence why we have issues such as https://github.com/akkadotnet/akka.net/issues/4031ThreadPool
, and when running Akka.Remote we have one dedicated thread pool for remoting and a second one for all /system
actors, plus a dedicated thread for the scheduler. All of those custom thread pool implementations are excellent for separating work queues, but not great at managing threads efficiently within a single Akka.NET process.Solutions, in order of least risk to existing Akka.NET users / implementations:
DedicatedThreadPool
to scale up and scale down, per @Zetanova's attempts - that's a good effort and can probably be optimized without that much effort. I'd really need to write an Idle CPU measurement and stick that in the DedicatedThreadPool repository, which I don't think would be terribly hard.
1a. Migrate the DotNetty Single Thread Event Executor / EventLoopGroup
to piggy-back off of the Akka.Remote dispatcher. Fewer threads to manage and keep track of idle / non-idle times.
1b. Migrate the Akka.Remote.DotNettyTransport batching system to piggy-back off of the HashedWheelTimer instead of DotNetty itself. If all of that can be done successfully, then none of DotNetty's threading primitives should be used.TaskScheduler
s and rewrite all mailbox processing to occur as a single-shot Task
. This is something we've discussed as part of the 1.5 milestone anyway and it would solve a lot of problems for Akka.NET actors (i.e. AsyncLocal
now works correctly from inside actors, all Task
s initiated from inside an actor get executed inside the same dispatcher, etc.) The risk is that there are a lot of potentially unknown side effects and it will require introducing new APIs and deprecating old ones. Most of these APIs are internal
so it's not a big deal, but some of them are public
and we always need to be careful with that. The thread management problems in this instance would be solved by moving all of our work onto the .NET ThreadPool
and simply using different TaskScheduler
instances to manage the workloads on a per-dispatcher basis.I'm doing some work on item number 2 to assess how feasible that is - since that can descend into yak-shaving pretty quickly.
Getting approach number 1 to work is more straightforward and @Zetanova has already done some good work there. It's just that I consider approach number 2 to be a better long-term solution to this problem, and if it's only marginally more expensive to implement that then that's what I'd prefer to do.
Some benchmark data from some of @Zetanova's PRs on my machine (AMD Ryzen 1st generation)
As a side note: looks like we significantly increased the number of messages written per round. That is going to crush the nuts of the first round of this benchmark due to the way batching is implemented - we can never hit the treshhold so long as the number of messages per round / per actor remains low on that first round. But, that's a good argument for leaving batching off by default I suppose.
dev
:ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5
Is Server GC: True
Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 1434, 139533.84
5, 1000000, 191022, 5235.61
10, 2000000, 181703, 11007.80
15, 3000000, 179781, 16687.83
20, 4000000, 170904, 23405.72
25, 5000000, 176704, 28296.62
30, 6000000, 175856, 34119.68
Done..
helios-idle-cpu-pooled
:ProcessorCount: 16 ClockSpeed: 0 MHZ Actor Count: 32 Messages sent/received per client: 200000 (2e5) Is Server GC: True
Num clients, Total [msg], Msgs/sec, Total [ms] 1, 200000, 1194, 167506.14 5, 1000000, 156765, 6379.06 10, 2000000, 156556, 12775.24 15, 3000000, 158815, 18890.32 20, 4000000, 164908, 24256.93 25, 5000000, 165810, 30155.52
helios-idle-cpu
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 1215, 164698.09
5, 1000000, 192419, 5197.48
10, 2000000, 190477, 10500.94
15, 3000000, 185679, 16157.99
20, 4000000, 183209, 21833.07
25, 5000000, 126657, 39477.82
30, 6000000, 192314, 31199.53
@Aaronontheweb Thx for testing.
In the 'helios-idle-cpu-pooled' branch is only a mode of the DedicatedThreadPoolTaskScheduler that schedules work on the dotnet ThreadPool. I fought that akka is already using TaskScheduler in the Dispatcher, it does not use it. https://github.com/Zetanova/akka.net/blob/0fb700d0754c447652e121337ca41fd44900eb65/src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs#L114-L267
You can use it for your approach 2). If the dispatchers would use this TaskScheduler then WorkItems would be processes in a loop in parallel up to ProcessorCount and a pooled Thread would be released only after an empty WorkItems queue. It is the same as before but without the custom DeticatedThreadPool implementation.
If the .net ThreadPool is not creating threads fast enough, it could be manipulated with ThreadPool.SetMinThreads
@Zetanova I think you have the right idea with your design thus far.
After doing some tire-kicking on approach number 2 - that's a big hairy redesign that won't solve problems for people with idle CPU issues right now. I'm going to suggest that we try approach number 1 and get a fix out immediately so we can improve the Akka.NET experience for users running on 1.3 and 1.4 right now. Implementing approach number 2 will likely need to wait until Akka.NET v1.5.
@Aaronontheweb I made now simple new commit. It replaces the ForkJoinExecutor with the TaskSchdulerExecuter but uses the new DedicatedThreadPoolTaskScheduler https://github.com/Zetanova/akka.net/tree/helios-idle-cpu-pooled
PingPong works good, memory and GC got lower.
Even with this change there will be most likely a high decrease in idle CPU.
If possible pls test this one with RemotePingPong too,
Will do - I'll take a look. I'm working on an idle CPU benchmark for DedicatedThreadPool now - if that works well I'll do one for Akka.NET too
Working on some specs to actually measure this here: https://github.com/helios-io/DedicatedThreadPool/pull/23
So in case you're wondering what I'm doing, here's my approach:
docker stats
- be able to do this repeatedly via a unit test. I want this so we have a quantifiable baseline;The UnfairSemaphore
in the DedicatedThreadPool does an excellent job limiting the number of threads from creeping up when CPU count is low, which I've been able to verify via manually changing the CPU levels up and down. Running 1600 idle threads on a 16 core machine = 0% CPU once the queue is empty.
I can't even reproduce the idle CPU issues at the moment - so it makes me wonder if the issues showing up in Akka.NET have another side effect (i.e. intermittent load applied by scheduler-driven messaging) that is creating the issue. I'm going to continue to play with this.
Running an idle Cluster.WebCrawler cluster:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
PIDS
a17996cdd6f6 clusterwebcrawler_webcrawler.web_1 10.75% 87.4MiB / 50.17GiB 0.17% 120kB / 122kB 0B / 0B
57
c548cb431955 clusterwebcrawler_webcrawler.crawlservice_1 8.25% 44.29MiB / 50.17GiB 0.09% 125kB / 123kB 0B / 0B
39
06e38eed576d clusterwebcrawler_webcrawler.trackerservice_1 10.75% 46.03MiB / 50.17GiB 0.09% 130kB / 127kB 0B / 0B
39
214aec75d2b5 clusterwebcrawler_webcrawler.lighthouse2_1 0.53% 33.39MiB / 50.17GiB 0.07% 1.16kB / 0B 0B / 0B
22
4996a84e06ef clusterwebcrawler_webcrawler.lighthouse_1 5.10% 42.62MiB / 50.17GiB 0.08% 134kB / 133kB 0B / 0B
Lighthouse 2 has no connections - it's not included in the cluster. This tells me that there's something other than the DedicatedThreadPool design itself that is responsible for this. Even on a less powerful Intel machine I can't generate much idle CPU using just the DedicatedThreadPool.
Looks like the CLR solves this problem via a Hill-climbing algorithm to continually try to optimize the thread count https://github.com/dotnet/runtime/blob/4dc2ee1b5c0598ca02a69f63d03201129a3bf3f1/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.HillClimbing.cs
Interesting... PortableThreadPool
is newer bits. Too bad it's still very tightly coupled and not re-usable.
Lighthouse 2 has no connections - it's not included in the cluster. This tells me that there's something other than the DedicatedThreadPool design itself that is responsible for this. Even on a less powerful Intel machine I can't generate much idle CPU using just the DedicatedThreadPool.
Thought:
All 7 nodes are idling and consume 100% (docker is limited to 3 cores)
Has anything been done to check if this is a resource constraint issue? HashedWheelTimer and Dotnetty Executor will each take one thread on their own, alongside whatever else each DTP winds up doing.
yeah, that was my thinking too @to11mtm - I think it's a combination of factors.
One thing I can do - make an IEventLoop
that runs on the Akka.Remote dispatcher so DotNetty doesn't fire up its own threadpool. It might be a bit of a pain in the ass but I can try.
yeah, that was my thinking too @to11mtm - I think it's a combination of factors.
One thing I can do - make an
IEventLoop
that runs on the Akka.Remote dispatcher so DotNetty doesn't fire up its own threadpool. It might be a bit of a pain in the ass but I can try.
Looks at everything needed to implement IEventLoop
and it's inheritors. Ouch. That said, there could be some ancillary benefits from being on the same threadpool in that case, data cache locality and the like. I know with my transport work, there were some scenarios where putting everything in the same pool (i.e. remote, tcp workers, streams) gave benefits. Not just from a 'less threadpool' standpoint either... There were some scenarios where a dispatcher with affinity (science experiment here) gave major boosts to performance in low message traffic scenarios.
@Aaronontheweb The issue is only in a formed cluster with or without load. There can be no user-actors on the node.
What makes most of "idle-cpu" usage is the spin-lock. most of the mutex/timers are doing it before they thread gets free/paused
If there is Absolut no work there are no spin-waits, but if one work item comes from time to time (500ms, 1000ms) the spins will happen.
The akka scheduler is ticking with 100ms I think cluster/dotnetty is implemented with an ticker too.
@Aaronontheweb pls try a cluster with 3-5 nodes https://github.com/Zetanova/akka.net/tree/helios-idle-cpu-pooled I disabled there the DTP completely or pls tell me how i can run the MultiNode UnitTests, somehow i don't get it
https://github.com/Aaronontheweb/akka.net/tree/feature/IEventLoopGroup-dispatcher - tried moving the entire DotNetty IEventLoopGroup
on top of the Akka.Remote dispatcher. Didn't work - DotNetty pipeline is tightly coupled to its concurrency constructs. Wanted to cite some proof of work here though.
We're working on multiple parallel attempts to address this.
i am pretty sure that the idle load comes from a spin-wait of an event-handle and the component like DotNetty tick <40ms. What happens is:
1) Work Item arrives (NoOp-Tick or real work item) 2) Wait Handle gets signaled 3) Thread awakes 4) processes work items 5) Thread has no work, wait for new signal or timeout 6) Because it waits on a signal it will spin-wait for a short time until the thread gets "full" paused
1) Timeout happens 2) Wait Handle gets signaled by timeout 3) Thread awakes 4) no work items to process 5) Thread has no work, wait for new signal or timeout 6) Because it waits on a signal it will spin-wait for a short time until the thread gets "full" paused
If the timeout is very low <30ms or the signal of an NoOp-Tick comes very frequently <30ms the spin-waits of the WaitHandle are adding up.
If the timeout is low, the fix would be just to remove the wait on the signal only in "Case B / Point 5", to remove the spin-wait
... 5) Thread has no work, wait ONLY on the a short timeout (aka Thread.Sleep) ...
I'm in agreement on the causes here - just working on how to safely reduce amount of "expensive non-work" occurring without creating additional problems.
Achieved a 50% reduction in idle CPU here: https://github.com/akkadotnet/akka.net/pull/4678#issuecomment-747754936
I still have the issue with idle nodes more or less like in https://github.com/akkadotnet/akka.net/issues/4434 docker akka.net 1.4.11 dotnet 3.1.404 debug and release builds
All 7 nodes are idling and consume 100% (docker is limited to 3 cores) The main hotpatch is still in dotnetty
messages/traffic is low the node is idling.