akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.73k stars 1.04k forks source link

Quarantine After Apparent Self-Association Failure #6310

Open nullcheck opened 1 year ago

nullcheck commented 1 year ago

Version Information Version of Akka.NET? 1.4.28

Which Akka.NET Modules? Cluster, Cluster Sharding, Distributed Data, Persistence, Streams

Describe the bug Note that this might not be a bug but just an incorrect use of Akka on our end. A node of ours got quarantined after seemingly losing an association to itself.

From the log of that node (which was using port 49794):

[akka://my-cluster/system/endpointManager] Association to [akka.tcp://my-cluster@mycluster.corporate.net:49794] with UID [1642503559] is irrecoverably failed. Quarantining address.
System.TimeoutException: Delivery of system messages timed out and they were dropped

[akka://my-cluster/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fmy-cluster%40corporate.net%3A49794-10] Removing receive buffers for [akka.tcp://my-cluster@mycluster.corporate.net:49794]->[akka.tcp://my-cluster@mycluster.corporate.net:49794]

It doesn't look like the node was especially busy before that (but there is that TimeoutException) and other nodes don't seem to have had issues with that node before either.

To Reproduce Unfortunately we have currently no way to reproduce this issue.

Expected behavior The association failure not to happen.

Actual behavior The association failure occurred.

Screenshots n/a

Environment Windows, .NET 6.0.9

Additional context As requested by @Aaronontheweb on discord the HOCON in effect has been attached:

if you tried to send a remote ActorSelection to yourself what happens? I don't know if we have a test case for that that's easy to reproduce at least if you wouldn't mind filing an issue and showing us a sanitized HOCON configuration that would be very helpful and we can look at reproducing the error as well

This is the HOCON of the failing node only, please let me know if you need the others as well.

Aaronontheweb commented 1 year ago

Thanks @nullcheck - we'll see if we can reproduce the "self-association" you reported