Closed voltcode closed 3 years ago
Akka 1.4.1
introduced a new feature in Akka.Remote
called "batch writer". Not sure if it might be related, but you can read more here.
Try setting akka.remote.dot-netty.tcp.batching.enabled
to false
. This will revert the way Akka.Remote
behaves to how it did before (1.3.x
).
Thanks @ismaelhamed . Your suggestion seemed to work. Another thing we tried was to send out the pings every 30 seconds - it also helped.
It seems to me that there's some bug in here though, triggered by communicating often within short periods of time. It's hard to tell when batching is a good idea and when it is not. Moreover, I'd like to know if Akka.Cluster internal gossip could also be affected by batching with similar results.
I'm going to leave this for repo owners for review whether this ticket should be closed or not.
cc @Aaronontheweb
@voltcode the batching code doesn't work well with systems that aren't running under continuously high loads. The issue in this case might be that the size of the batch may have exceeded the size of the outbound buffer. I'll need to look more into the error message to see what the root causes of that are.
A reasonable fix, and probably what we should have done as part of the 1.4.0 release is to make the batching system opt-in rather than opt-out. We found some other issues with it on "less busy" systems, such as a significant increase in idle CPU consumption from using the DotNetty scheduler behind the scenes.
@Aaronontheweb thanks. I'm more worried about the fact that DotNetty could be at fault and it hasn't been updated in ages. That project seems dead, could be a problem in the future for Akka.NET
@Aaronontheweb regarding the feasibility of DotNetty in the long run, I guess we're betting on #4007 ?
DotNetty vs. Artery:
@Aaronontheweb Maybe we can introduce an IDoNotBuffer
and/or IWithHighPriority
interface marker.
It can then be used to indicate the BatchBuffer to flush immediately.
The name IWithHighPriority
would create some conflicts with message resorting.
Maybe the BatchBuffer could be extended with a priority feature,
but to trigger a buffer-flush with the Last-NoOp message in a message-sequence would be nice to have.
Maybe we can introduce an IDoNotBuffer or IWithHighPriority interface marker. It can then be used to indicate the BatchBuffer to flush immediately.
I think we already have an interface like this in Akka.Remote but I think it's reserved for internal messages.
On top of that though - it might be tough to add support for this without making some significant changes to the underlying Transport
base class, since the transport itself has no concept of the message being sent over the wire (it's just an array of bytes by the time it reaches the transport), we'd need have some way of signaling this as part of the AssociationHandle.Write
method.
@Aaronontheweb Maybe a ZeroWrite could trigger a buffer-flush.
I think we might need to just turn off batching by default - for busy systems it should work well. For systems that have relatively low traffic the transmission delay might be too disruptive.
On top of that though, I have reason to suspect that the DotNetty scheduler is an enormous CPU hog and isn't implement very efficiently. We had to deal with some issues surrounding that not long after the initial v1.4 release.
With regards to the DotNetty Scheduler... especially scheduling the flush....
In DotNetty's SingleThreadEventExecutor, I personally find the polling code suspect. I could be reading this wrong but it almost looks like there could be cases where if we have the flush scheduled the scheduler will 'wait' for that flush if the queueing happens in the 'wrong' order.
I've put together a branch that instead uses System.Threading.Timer
for the Flush task scheduling. It -looks- like it works whether we do the flush on _context.Executor.Execute
or just doing the flush directly. Gut says the .Execute
is safer but IDK.
Unfortunately my system is very unpredictable on remote benchmarks, so it's hard to say whether there is a sure-fire difference. If someone has some thoughts (even abstract) on a good way to test this I can try to write up the actual code.
Update on this - working on an auto-tuning algorithm for the batching system in DotNetty which should, appropriately, scale down the batch size.
We're also going to ditch DotNetty's STEE for scheduling - it's a performance hog. @to11mtm has done some great work on that and I'm going to submit a PR shortly that introduces it as a stand-alone change.
This should be resolved via https://github.com/akkadotnet/akka.net/pull/4685
Scenario:
We have a Windows service that pings other windows services and iis app pools (ca.10 services) every 10 seconds using Akka.Remote. Ping is implemented as Ask.
Some of them may be dead for a while, etc. - it is fairly normal (for example one of the services is an IIS app pool, which will be activated only after user action. All services are on the same machine.
With Akka 1.3.8 it was working correctly.
On akka 1.4.6 we started observing the following problem on our test virtual machines: 1) start VM 2) services wake up, ping is working correctly 3) after at least 10 hours, the ping stops working.
Error We get the following exception when one of our services tries to talk to another using for example WCF using http:
basically, the OS prevents the services to talk to each other at all.
We've never had such problems before Akka upgrade.
Another symptom is that the services stop being responsive at all. Our test VM is resource constrained (1 CPU cpu @4 GHz, 4GB RAM).
debugging snapshot looks like this:
If we restart the services that sends out the pings,situation goes back to normal.
We tried updating nugets to 1.4.10 to no avail.
Please advise on what's the best course of action is. Should we downgrade to 1.3.X ? Are there any settings with 1.4.X that we could test? We could change the time (for example 10s -> 30s) between pings, but we cant be sure if it fixes the problem on prod servers with more cores but lower CPU freq. Maybe it will just postpone its occurrence?
It seems that dotnetty could be at fault, I checked their repo and it seems dead (side note: looks like that could be a huge problem for Akka long term).