Akka.Remote - exhaustion of TCP buffer after updating from 1.3.8 to 1.4.6

voltcode commented 4 years ago

Scenario:

We have a Windows service that pings other windows services and iis app pools (ca.10 services) every 10 seconds using Akka.Remote. Ping is implemented as Ask.

Some of them may be dead for a while, etc. - it is fairly normal (for example one of the services is an IIS app pool, which will be activated only after user action. All services are on the same machine.

With Akka 1.3.8 it was working correctly.

On akka 1.4.6 we started observing the following problem on our test virtual machines: 1) start VM 2) services wake up, ping is working correctly 3) after at least 10 hours, the ping stops working.

Error We get the following exception when one of our services tries to talk to another using for example WCF using http:

Unable to connect to the remote server ---> System.Net.Sockets.SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full 127.0.0.1:80
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)

basically, the OS prevents the services to talk to each other at all.

We've never had such problems before Akka upgrade.

Another symptom is that the services stop being responsive at all. Our test VM is resource constrained (1 CPU cpu @4 GHz, 4GB RAM).

debugging snapshot looks like this:

mscorlib_ni!System.Threading.Tasks.Task.InnerInvoke() 
mscorlib_ni!System.Threading.Tasks.Task.Execute() 
mscorlib_ni!System.Threading.Tasks.Task.ExecutionContextCallback(System.Object) 
mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.
Threading.ContextCallback, System.Object, Boolean) 
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threadin
g.ContextCallback, System.Object, Boolean) 
mscorlib_ni!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef) 
mscorlib_ni!System.Threading.Tasks.Task.ExecuteEntry(Boolean) 
mscorlib_ni!System.Threading.Tasks.TaskScheduler.TryExecuteTask(System.Threading.Tasks.Task) 
mscorlib_ni!System.Threading.Tasks.Task.ScheduleAndStart(Boolean) 
mscorlib_ni!System.Threading.Tasks.Task.InternalStartNew(System.Threading.Tasks.Task, System.Delegat
e, System.Object, System.Threading.CancellationToken, System.Threading.Tasks.TaskScheduler, System.T
hreading.Tasks.TaskCreationOptions, System.Threading.Tasks.InternalTaskOptions, System.Threading.Sta
ckCrawlMark ByRef) 
mscorlib_ni!System.Threading.Tasks.TaskFactory.StartNew(System.Action, System.Threading.Cancellation
Token, System.Threading.Tasks.TaskCreationOptions, System.Threading.Tasks.TaskScheduler) 
mscorlib_ni!System.Threading.Tasks.ThreadPoolTaskScheduler.LongRunningThreadWork(System.Object) 
mscorlib_ni!System.Threading.ThreadHelper.ThreadStart_Context(System.Object) 
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threadin
g.ContextCallback, System.Object) 
mscorlib_ni!System.Threading.ThreadHelper.ThreadStart(System.Object) 
DotNetty.Common.dll!Unknown 
[[HelperMethodFrame] 
Akka.Remote.dll!Unknown 
mscorlib_ni!System.Collections.Concurrent.ConcurrentQueue`1 
mscorlib_ni!System.Collections.Concurrent.ConcurrentQueue`1[[System.__Canon, mscorlib]].Enqueue(Syst
em.__Canon) 
DotNetty.Common.Concurrency.AbstractScheduledEventExecutor.Schedule(DotNetty.Common.Concurrency.IRun
nable, System.TimeSpan) 
Akka.Remote.Transport.DotNetty.BatchWriter 
DotNetty.Common.Concurrency.RunnableScheduledTask.Execute() 
DotNetty.Common.Concurrency.ScheduledTask.Run() 
DotNetty.Common.Concurrency.AbstractEventExecutor.SafeExecute(DotNetty.Common.Concurrency.IRunnable) 
DotNetty.Common.Concurrency.SingleThreadEventExecutor.RunAllTasks(DotNetty.Common.PreciseTimeSpan) 
DotNetty.Common.Concurrency.SingleThreadEventExecutor.b__26_0() 
DotNetty.Common.Concurrency.ExecutorTaskScheduler.QueueTask(System.Threading.Tasks.Task) 
DotNetty.Common.Concurrency.SingleThreadEventExecutor.Loop() 
DotNetty.Common.Concurrency.XThread

If we restart the services that sends out the pings,situation goes back to normal.

We tried updating nugets to 1.4.10 to no avail.

Please advise on what's the best course of action is. Should we downgrade to 1.3.X ? Are there any settings with 1.4.X that we could test? We could change the time (for example 10s -> 30s) between pings, but we cant be sure if it fixes the problem on prod servers with more cores but lower CPU freq. Maybe it will just postpone its occurrence?

It seems that dotnetty could be at fault, I checked their repo and it seems dead (side note: looks like that could be a huge problem for Akka long term).

ismaelhamed commented 4 years ago

Akka 1.4.1 introduced a new feature in Akka.Remote called "batch writer". Not sure if it might be related, but you can read more here.

Try setting akka.remote.dot-netty.tcp.batching.enabled to false. This will revert the way Akka.Remote behaves to how it did before (1.3.x).

voltcode commented 4 years ago

Thanks @ismaelhamed . Your suggestion seemed to work. Another thing we tried was to send out the pings every 30 seconds - it also helped.

It seems to me that there's some bug in here though, triggered by communicating often within short periods of time. It's hard to tell when batching is a good idea and when it is not. Moreover, I'd like to know if Akka.Cluster internal gossip could also be affected by batching with similar results.

I'm going to leave this for repo owners for review whether this ticket should be closed or not.

ismaelhamed commented 4 years ago

cc @Aaronontheweb

Aaronontheweb commented 4 years ago

@voltcode the batching code doesn't work well with systems that aren't running under continuously high loads. The issue in this case might be that the size of the batch may have exceeded the size of the outbound buffer. I'll need to look more into the error message to see what the root causes of that are.

A reasonable fix, and probably what we should have done as part of the 1.4.0 release is to make the batching system opt-in rather than opt-out. We found some other issues with it on "less busy" systems, such as a significant increase in idle CPU consumption from using the DotNetty scheduler behind the scenes.

voltcode commented 4 years ago

@Aaronontheweb thanks. I'm more worried about the fact that DotNetty could be at fault and it hasn't been updated in ages. That project seems dead, could be a problem in the future for Akka.NET

ismaelhamed commented 4 years ago

@Aaronontheweb regarding the feasibility of DotNetty in the long run, I guess we're betting on #4007 ?

Aaronontheweb commented 4 years ago

DotNetty vs. Artery:

We're probably going to still have to support the "DotNetty" transport in some way, shape, or form even after Artery ships since the Artery protocol is an entirely new animal and not wire compatible - starting with the 1.5 release users will be able to choose between the two of them. We'll want to make sure both options are viable.
Artery should definitely be faster - since some of the DotNetty transports' problems are architectural and can't be solved through clever data structures or more performant APIs - same issues as classic remoting in Akka.
We really ought to replace DotNetty with something else - either a fork, Project Bedrock, something custom, etc. We have some real byterot with the way things stand right now and that needs to be addressed even if the ultimate future plan is to replace the current remoting system with Artery in its entirety in the future.

Zetanova commented 4 years ago

@Aaronontheweb Maybe we can introduce an IDoNotBuffer and/or IWithHighPriority interface marker. It can then be used to indicate the BatchBuffer to flush immediately.

The name IWithHighPriority would create some conflicts with message resorting. Maybe the BatchBuffer could be extended with a priority feature, but to trigger a buffer-flush with the Last-NoOp message in a message-sequence would be nice to have.

Aaronontheweb commented 4 years ago

Maybe we can introduce an IDoNotBuffer or IWithHighPriority interface marker. It can then be used to indicate the BatchBuffer to flush immediately.

I think we already have an interface like this in Akka.Remote but I think it's reserved for internal messages.

On top of that though - it might be tough to add support for this without making some significant changes to the underlying Transport base class, since the transport itself has no concept of the message being sent over the wire (it's just an array of bytes by the time it reaches the transport), we'd need have some way of signaling this as part of the AssociationHandle.Write method.

Zetanova commented 4 years ago

@Aaronontheweb Maybe a ZeroWrite could trigger a buffer-flush.

Aaronontheweb commented 4 years ago

I think we might need to just turn off batching by default - for busy systems it should work well. For systems that have relatively low traffic the transmission delay might be too disruptive.

On top of that though, I have reason to suspect that the DotNetty scheduler is an enormous CPU hog and isn't implement very efficiently. We had to deal with some issues surrounding that not long after the initial v1.4 release.

to11mtm commented 4 years ago

With regards to the DotNetty Scheduler... especially scheduling the flush....

In DotNetty's SingleThreadEventExecutor, I personally find the polling code suspect. I could be reading this wrong but it almost looks like there could be cases where if we have the flush scheduled the scheduler will 'wait' for that flush if the queueing happens in the 'wrong' order.

I've put together a branch that instead uses System.Threading.Timer for the Flush task scheduling. It -looks- like it works whether we do the flush on _context.Executor.Execute or just doing the flush directly. Gut says the .Execute is safer but IDK.

Unfortunately my system is very unpredictable on remote benchmarks, so it's hard to say whether there is a sure-fire difference. If someone has some thoughts (even abstract) on a good way to test this I can try to write up the actual code.

Aaronontheweb commented 3 years ago

Update on this - working on an auto-tuning algorithm for the batching system in DotNetty which should, appropriately, scale down the batch size.

We're also going to ditch DotNetty's STEE for scheduling - it's a performance hog. @to11mtm has done some great work on that and I'm going to submit a PR shortly that introduces it as a stand-alone change.

Aaronontheweb commented 3 years ago

This should be resolved via https://github.com/akkadotnet/akka.net/pull/4685

akkadotnet / akka.net

Akka.Remote - exhaustion of TCP buffer after updating from 1.3.8 to 1.4.6 #4563