akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.7k stars 1.04k forks source link

Port exhaustion problem with Akka.Cluster #2575

Closed Blind-Striker closed 5 years ago

Blind-Striker commented 7 years ago

Hi,

We created an Akka Cluster infrastructure for Sms, Email and Push notifications. 3 different kind of nodes are exist in the system, which are client, sender and lighthouse. Client role is being used by Web application and API application(Web and API is hosted at IIS). Lighthouse and Sender roles are being hosted as a Windows service. We are also running 4 more console applications of same windows service that in sender role.

We've been experiencing port exhaustion problems in our Web Server for about 2 weeks. Our Web Server starting to consume the ports quickly and after a while we can not do any SQL operations. Sometimes we have no choice but to do iis reset. This problems occur if there are more than one nodes that in sender role. We diagnosed it and found the source of the problem.

---------------
HOST                  OPEN    WAIT
SRV_NOTIFICATION      3429    0
SRV_LOCAL             198     0
SRV_UNDEFINED_IPV4    23      0
SRV_DATABASE          15      0
SRV_AUTH              4       0
SRV_API               6       0
SRV_UNDEFINED_IPV6    19      0
SRV_INBOUND           12347   5

TotalPortsInUse   : 17286
MaxUserPorts      : 64510
TcpTimedWaitDelay : 30
03/23/2017 09:30:10
---------------

SRV_NOTIFICATION is server that lighthouse ve sender's nodes running. SRV_INBOUND is our Web Server. After checking this table, we checked what ports on the Web Server were assigned. And we got results like table below. In netstat there were more than 12000 connections like this :

TCP    192.168.1.10:65531     192.168.1.10:3564      ESTABLISHED     5716   [w3wp.exe]
TCP    192.168.1.10:65532     192.168.1.101:17527    ESTABLISHED     5716   [w3wp.exe]
TCP    192.168.1.10:65533     192.168.1.101:17527    ESTABLISHED     5716   [w3wp.exe]
TCP    192.168.1.10:65534     192.168.1.10:3564      ESTABLISHED     5716   [w3wp.exe]

192.168.1.10 Web Server 192.168.1.10:3564 API 192.168.1.101:17527 Lighthouse

The connections are opening but not closing.

After deployments our Web and Api applications are leaving and rejoining to do cluster and they configured for fixed ports. We're monitoring our cluster with application that created by @cgstevens. Even we implemented the grecaful shutdown logic for Actor System sometimes WEB and API applications cant leave the cluster so we have to remove nodes manualy and restart the actor system.

We have reproduce the problem in our development environment and recorded a video below

https://drive.google.com/file/d/0B5ZNfLACId3jMWUyOWliMUhNWTQ/view

Our hocon configuration for nodes are below :

WEB and API

<akka>
    <hocon><![CDATA[
            akka{
                loglevel = DEBUG

                actor{
                    provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"

                    deployment {
                        /coordinatorRouter {
                            router = round-robin-group
                            routees.paths = ["/user/NotificationCoordinator"]
                            cluster {
                                    enabled = on
                                    max-nr-of-instances-per-node = 1
                                    allow-local-routees = off
                                    use-role = sender
                            }
                        }

                        /decidingRouter {
                            router = round-robin-group
                            routees.paths = ["/user/NotificationDeciding"]
                            cluster {
                                    enabled = on
                                    max-nr-of-instances-per-node = 1
                                    allow-local-routees = off
                                    use-role = sender
                            }
                        }
                    }

                    serializers {
                            wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
                    }

                    serialization-bindings {
                     "System.Object" = wire
                    }

                    debug{
                        receive = on
                        autoreceive = on
                        lifecycle = on
                        event-stream = on
                        unhandled = on
                    }
                }

                remote {
                    helios.tcp {
                            transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                            applied-adapters = []
                            transport-protocol = tcp
                            hostname = "192.168.1.10"
                            port = 3564
                    }
                }

                cluster {
                        seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
                        roles = [client]
                }
            }
        ]]>
    </hocon>
</akka>

Lighthouse

<akka>
        <hocon>
            <![CDATA[
                    lighthouse{
                            actorsystem: "notificationSystem"
                        }

                    akka {
                        actor { 
                            provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"

                            serializers {
                                wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
                            }

                            serialization-bindings {
                                "System.Object" = wire
                            }
                        }

                        remote {
                            log-remote-lifecycle-events = DEBUG
                            helios.tcp {
                                transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                                applied-adapters = []
                                transport-protocol = tcp
                                #will be populated with a dynamic host-name at runtime if left uncommented
                                #public-hostname = "192.168.1.100"
                                hostname = "192.168.1.101"
                                port = 17527
                            }
                        }            

                        loggers = ["Akka.Logger.NLog.NLogLogger,Akka.Logger.NLog"]

                        cluster {
                            seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
                            roles = [lighthouse]
                        }
                    }
            ]]>
        </hocon>
    </akka>

Sender

<akka>
    <hocon><![CDATA[
                akka{
                    # stdout-loglevel = DEBUG
                    loglevel = DEBUG
                    # log-config-on-start = on

                    loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"]

                    actor{
                        debug {  
                            # receive = on 
                            # autoreceive = on
                            # lifecycle = on
                            # event-stream = on
                            # unhandled = on
                        }         

                        provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"           

                        serializers {
                            wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
                        }

                        serialization-bindings {
                         "System.Object" = wire
                        }

                        deployment{                         
                            /NotificationCoordinator/LoggingCoordinator/DatabaseActor{
                                router = round-robin-pool
                                resizer{
                                    enabled = on
                                    lower-bound = 3
                                    upper-bound = 5
                                }
                            }                           

                            /NotificationDeciding/NotificationDecidingWorkerActor{
                                router = round-robin-pool
                                resizer{
                                    enabled = on
                                    lower-bound = 3
                                    upper-bound = 5
                                }
                            }

                            /ScheduledNotificationCoordinator/SendToProMaster/JobToProWorker{
                                router = round-robin-pool
                                resizer{
                                    enabled = on
                                    lower-bound = 3
                                    upper-bound = 5
                                }
                            }
                        }
                    }

                 remote{                            
                            log-remote-lifecycle-events = DEBUG
                            log-received-messages = on

                            helios.tcp{
                                transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                                applied-adapters = []
                                transport-protocol = tcp
                                #will be populated with a dynamic host-name at runtime if left uncommented
                                #public-hostname = "POPULATE STATIC IP HERE"
                                hostname = "192.168.1.101"
                                port = 0
                        }
                    }

                    cluster {
                        seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
                        roles = [sender]
                    }
                }
            ]]></hocon>
  </akka>

Cluster.Monitor

    <akka>
        <hocon>
            <![CDATA[
                    akka {
                        stdout-loglevel = INFO
                        loglevel = INFO
                        log-config-on-start = off 

                        actor {
                            provider = "Akka.Remote.RemoteActorRefProvider, Akka.Remote"                

                            serializers {
                                wire = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
                            }
                            serialization-bindings {
                                "System.Object" = wire
                            }

                            deployment {                                
                                /clustermanager {
                                    dispatcher = akka.actor.synchronized-dispatcher
                                }
                            }
                        }

                        remote {
                            log-remote-lifecycle-events = INFO
                            log-received-messages = off
                            log-sent-messages = off

                            helios.tcp {                                
                                transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                                applied-adapters = []
                                transport-protocol = tcp
                                #will be populated with a dynamic host-name at runtime if left uncommented
                                #public-hostname = "127.0.0.1"
                                hostname = "192.168.1.101"
                                port = 0
                            }
                        }            

                        cluster {                           
                        seed-nodes = ["akka.tcp://notificationSystem@192.168.1.101:17527"]
                            roles = [ClusterManager]

                            client {
                                initial-contacts = ["akka.tcp://notificationSystem@192.168.1.101:17527/system/receptionist"]
                            }
                        }
                    }
            ]]>
        </hocon>
    </akka>
Aaronontheweb commented 7 years ago

@Blind-Striker is this an issue with Lighthouse specifically or other Akka.NET services as well?

edit: Nevermind. You said it's the sender service.

This sounds like this is an issue with unclean shutdowns occurring.

I'd strongly recommend using the latest nightly builds http://getakka.net/docs/akka-developers/nightly-builds until Akka.NET 1.2 is released. The new CoordinatedShutdown feature that comes built into Akka.NET is much, much better about enforcing shutdown rules than what we had before.

Aaronontheweb commented 7 years ago

On top of that, the new DotNetty transport is going to have improved handling of this type of stuff over Helios.

Aaronontheweb commented 7 years ago

Removed my previous comment. That was an error. Not cluster's job to do that. We're going to get to the bottom of this though, but in the meantime use the nightlies and let us know on this issue if that improves things. 1.2 is due to be out imminently; blocked by a third party nuget release that is due out any day now.

Blind-Striker commented 7 years ago

Thanks @Aaronontheweb, we will do necessary updates and notify you about results. In meanwhile should we open an issue to Stack Overflow for other people that experiencing this problem?

Aaronontheweb commented 7 years ago

@Blind-Striker yep, that would be great.

Aaronontheweb commented 7 years ago

Working theory on the issue is that we aren't shutting down the associations a reachable node makes when it attempts to associate with an unreachable node (which happens repeatedly on a timer.) This is an Akka.Remote issue ultimately. Going to investigate.

Blind-Striker commented 7 years ago

Stack Overflow link :

http://stackoverflow.com/questions/43128080/port-exhaustion-issue-with-akka-cluster

jalchr commented 7 years ago

I think I'm having the same issue. I have upgraded to v1.2. After 3 hours of starting a service, it tries to connect to unknown ports (not configured in the HOCON)

Here is an excerpt of my logs:

2017-04-18 03:46:53,582 [1] INFO Mayadeen.Archive.VideoExporter.ConfigureService - Setting up the service 2017-04-18 03:46:53,660 [1] INFO Topshelf.HostFactory - Configuration Result: [Success] Name ExportService [Success] DisplayName Almayadeen video export service [Success] Description Almaydeen archive system, video export service [Success] ServiceName ExportService 2017-04-18 03:46:53,660 [1] INFO Topshelf.HostConfigurators.HostConfiguratorImpl - Topshelf v4.0.0.0, .NET Framework v4.0.30319.42000 2017-04-18 03:46:53,676 [1] INFO Topshelf.Runtime.Windows.WindowsServiceHost - Starting as a Windows service 2017-04-18 03:46:53,676 [6] INFO Topshelf.Runtime.Windows.WindowsServiceHost - [Topshelf] Starting 2017-04-18 03:46:53,692 [6] INFO ProcessingService - Starting ... 'Export service' 2017-04-18 03:46:57,208 [11] INFO Akka.Event.DummyClassForStringSources - Starting remoting 2017-04-18 03:46:57,379 [8] INFO Akka.Event.DummyClassForStringSources - Remoting started; listening on addresses : [akka.tcp://ArchiveSystem@140.125.4.2:16667] 2017-04-18 03:46:57,379 [9] INFO Akka.Event.DummyClassForStringSources - Remoting now listens on addresses: [akka.tcp://ArchiveSystem@140.125.4.2:16667] 2017-04-18 03:46:57,426 [10] INFO Akka.Event.DummyClassForStringSources - Cluster Node [akka.tcp://ArchiveSystem@140.125.4.2:16667] - Starting up... 2017-04-18 03:46:57,442 [8] INFO Akka.Event.DummyClassForStringSources - Cluster Node [akka.tcp://ArchiveSystem@140.125.4.2:16667] - Started up successfully 2017-04-18 03:46:57,458 [6] INFO ProcessingService - Started - 'Export service' 2017-04-18 03:46:57,458 [6] INFO Topshelf.Runtime.Windows.WindowsServiceHost - [Topshelf] Started 2017-04-18 03:46:57,458 [26] INFO Mayadeen.Archive.VideoExporter.Akka.ApiMaster - [PreStart] ApiMaster 2017-04-18 03:46:57,458 [26] INFO Mayadeen.Archive.VideoExporter.Akka.VideoMaster - [PreStart] VideoMaster 2017-04-18 03:46:57,661 [1] INFO Topshelf.Hosts.StartHost - The ExportService service was started. 2017-04-18 03:46:58,192 [24] INFO Akka.Cluster.ClusterCoreDaemon - Welcome from [akka.tcp://ArchiveSystem@140.125.4.5:4053] 2017-04-18 03:47:46,257 [30] INFO Akka.Cluster.ClusterCoreDaemon - Leader is moving node [akka.tcp://ArchiveSystem@140.125.4.2:16668] to [Up] 2017-04-18 03:51:08,332 [26] INFO Akka.Cluster.ClusterCoreDaemon - Leader is moving node [akka.tcp://ArchiveSystem@140.125.4.1:16668] to [Up]

2017-04-18 06:50:33,183 [33] WARN Akka.Event.DummyClassForStringSources - Association to [akka.tcp://ArchiveSystem@140.125.4.1:16668] having UID [1471622119] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

2017-04-18 06:50:40,293 [10] WARN Akka.Remote.EndpointWriter - AssociationError [akka.tcp://ArchiveSystem@140.125.4.2:16667] -> akka.tcp://ArchiveSystem@140.125.4.1:41762: Error [No connection could be made because the target machine actively refused it tcp://ArchiveSystem@140.125.4.1:41762] []

2017-04-18 06:50:40,293 [26] WARN Akka.Event.DummyClassForStringSources - Tried to associate with unreachable remote address [akka.tcp://ArchiveSystem@140.125.4.1:41762]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [No connection could be made because the target machine actively refused it tcp://ArchiveSystem@140.125.4.1:41762]

...

Aaronontheweb commented 7 years ago

@jalchr unknown ports? Are you using port 0 somewhere?

Aaronontheweb commented 7 years ago

@Blind-Striker I downgraded this issue from confirmed bug to potential bug because I haven't been able to reproduce this at all locally using 1.2.0. I monitoring the ports closely using SysInternals and chaos-monkey'd the cluster in several different ways. Saw all of the ports open and close as expected.

I also followed your video closely too, but even in there it doesn't look like the ports are staying open.

PM me on Gitter and let's take a look at this over Skype or email. I need more data on how to reproduce it before I can investigate further.

Aaronontheweb commented 7 years ago

Looking at this more closely, I think we may have an issue with DotNetty.

Related: https://github.com/Azure/DotNetty/issues/238

Working on some specs now to verify that it's not an Akka.Remote issue.

Blind-Striker commented 7 years ago

Hi @Aaronontheweb

Since we have reduced the number of IIS applications in our Web Server, we have never encountered this problem again. But in the tests that we do in our local, we get the same results when we apply the method of the video I sent above. I will write a sample application and send it to you to reproduce this case if I have time.

But i must say that we still using Akka 1.1.3 version. We will update akka to version 1.3.1 this week or next. I'll let you know the results.

Maybe this problem is no longer valid. I don't want to take up your time for this case

ZigMeowNyan commented 6 years ago

We've had an installation report a similar issue last week. The first and only report like it, but it's also a VM in a cloud environment, and has a lot more unrelated network activity going on. We're also using Akka 1.1.3. It's relevant to note that this version of Akka still uses Helios and not DotNetty. So the above referenced issue isn't directly related. I don't think Helios is a supported scenario anymore, so I'm not really looking for an investigation or anything. But it is worthwhile to report additional information.

This installation isn't using clustering at all. It's a collection of five services that communicate directly with each other remotely via Akka.Net. The service that was causing problems is just an inbound connector. On startup, the actor system is created, and 3 instances of ActorSelection for the other services are created. Beyond that, it opens up a listening TCP port and waits for incoming connections, handles those incoming messages, and routes them to other services via Akka. The service somehow had over 30000 ports open when the local admin terminated it. There weren't incoming connections at the time, but I can't say it hadn't been used at all since the service had been running multiple days.

Port was set to 0 in the HOCON config (under helios.tcp). Mostly under the assumption that this service associates with others, and none of them will initiate a connection with it. So it should theoretically be ok for it to use a dynamic port.

Some of the connector code does use the Ask<T> extension, which dynamically allocates actors and is most definitely not a pure Actor design. I'll probably be spending some time investigating the lifetime of ports opened via future actors allocated that way, and whether port 0 affects that. And maybe checking whether any heartbeat stuff comes into play.

Aaronontheweb commented 6 years ago

BTW, some updates on this.

Had some reports from users with versions of Akka.Cluster as recent as 1.3.7 that port exhaustion was an ongoing issue still, including one this week. The Akka.NET v1.3.8 release includes an upgrade to DotNetty v0.4.8, which has fixes for cleaning up sockets upon channel shutdown. Reports I've had from end-users affected by this issue indicate that the Akka.NET v1.3.8 / DotNetty v0.4.8 upgrades effectively address it.

I'm working on a methodology to prove this definitively, but I'd strongly recommend that you upgrade anyway since we appear to have eliminated at least some potential causes of this in the most recent release.

Ulimo commented 6 years ago

Hi, Im running Akka Cluster 1.3.8, and I just experienced port exhaustion. I recieved this event: A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use.

Followed by: TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

Unfortunately I did not do a netstat, but I saw that other services on the computer could not create an outgoing TCP connection as well in the event log. When I shut down the Akka services the problem resolved. So it still seem to be an ongoing issue.

Aaronontheweb commented 6 years ago

Thanks @Ulimo - I shelved an idea for being able to definitively test whether or not this is an issue because I thought our most recent DotNetty upgrade had resolved it, but given your bug report it appears that this is not the case. We'll begin looking back into it.

Ulimo commented 6 years ago

@Aaronontheweb fyi I added a script to both our test and production environment to monitor open TCP ports and timestamp, so it will be possible for me to link them together to the logs. If this happens again, is there any other information that should be collected that can simplify the process of finding the leak?

Aaronontheweb commented 5 years ago

Update on this, since @izavala and I have been working on it extensively for a few weeks.

We were able to create a reliable reproduction case for this here: https://github.com/petabridge/SocketLeakDetection along with a set of actors who can automatically kill the affected ActorSystem when the port count grows too rapidly or exceeds some configurable absolute threshold. You can read the README on that repository for a description of how that tool and its underlying math works.

More importantly, however - we have a fix for the issue here: https://github.com/akkadotnet/akka.net/pull/3764 - I'm working on cleaning this up and getting it ready for merge into 1.3.13, which we hope to release immediately. This was a very difficult bug to track down and reproduce, but thanks to the efforts and very detailed logs provided to us by affected end-users we were able to finally identify the underlying cause, which is that under some circumstances we never properly disassociated outbound associations that failed during the Akka.Remote handshake process. You can see my comment on the fix (ported from the JVM) here: https://github.com/akkadotnet/akka.net/pull/3764#discussion_r279027338

This issue happened most noticeably when a node inside Akka.Cluster was quarantined by at least one other node - which caused a perpetual loop of Akka.Cluster retrying the failed connection from the quarantined node back to the quarantiner, but having it rejected each time due to the Akka.Remote quarantine in effect. Since we never properly closed the outbound association, the inbound socket may not have properly closed either - hence why the port count could increase simultaneously on both nodes.

Please leave a comment here if you have any additional questions or insights about this - one of the nastier bugs we've ever had to track down.

Aaronontheweb commented 5 years ago

closed via #3764