akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.7k stars 1.04k forks source link

How to properly shutdown remote cluster node #2860

Closed gengle closed 7 years ago

gengle commented 7 years ago

Background

I have 3 nodes, using the following roles respectively:

The client node is a command line utility which uses a Clustered Routing Group against the api node. Once the command line utility completes, it terminates itself.

Issue

Immediately after the [client] node terminates, I receive the following errors in both [lighthouse] and [api] nodes.

System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at DotNetty.Transport.Channels.Sockets.SocketChannelAsyncOperation.Validate() at DotNetty.Transport.Channels.Sockets.AbstractSocketByteChannel.SocketByteChannelUnsafe.FinishRead(SocketChannelAsyncOperation operation) [12:45:30 WRN] Association with remote system akka.tcp://scheduler@127.0.0.1:2303 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointDisassociatedException: Disassociated at Akka.Remote.EndpointWriter.PublishAndThrow......

I've tried initiating

Question

How do I properly prevent such errors from happening.

I've tried the following with little success.

explicitly calling leave, and terminating the node.

var cluster = Cluster.Get(system);
cluster.Leave(cluster.SelfAddress);
system.Terminate();

using a ManualResetEvent, and waiting for RegisterOnMemberRemoved

var terminator = new ManualResetEvent(false);
var cluster = Cluster.Get(system);
cluster.RegisterOnMemberRemoved(()=> {
    system.Terminate();
    terminator.Set();
});
cluster.Leave(cluster.SelfAddress);
terminator.WaitOne();

What am I missing?

Aaronontheweb commented 7 years ago

Duplicate of: https://github.com/akkadotnet/akka.net/issues/2754

TL;DR;, you're doing the right thing but we're not handling this error message (which gets thrown even if the shutdown is clean) by DotNetty. We're going to need to prune it out of the logs. We had this issue with Helios too, the previous transport. It's an underlying issue with having to abort the async socket when it's waiting on an incoming receive and just not handling it properly.

gengle commented 7 years ago

Aarron - is there a commit I could reference to understand how the Helios bug was handled? I can take a stab at resolving as I can't have the production logs polluted with this.

Aaronontheweb commented 7 years ago

Sure thing @gengle - here's the PR that @maxcherednik submitted to resolve this issue with Helios back in January, right before our 1.1.3 patch: https://github.com/akkadotnet/akka.net/pull/2453

Aaronontheweb commented 7 years ago

You can send in the PR to the dev or the v1.3 branch. If you send it into dev we'll treat this as a bugfix for the 1.2.* branch of Akka.NET and be able to submit a release a small patch. If it goes into v1.3 we might have to take a little bit longer since that's part of a much larger planned release.

gengle commented 7 years ago

Thanks @Aaronontheweb - after a preliminary review, looks like we need to throw InvalidAssociationExceptionfor this System.Net.Sockets.SocketException instance. I'll take a stab at this over the weekend, I'll target dev so we can get this released sooner than v1.3 (netstandard support; super stoked!)

I'll use #2754 moving forward

Aaronontheweb commented 7 years ago

Ok, I'll close this and we can use #2754 as the issue going forward. Thanks for agreeing to help with this. Much appreciated! It might be as simple as checking the exception and not propagating it, but the test suite will help guide you on that.