Arkatufus commented 1 week ago

It has been observed that Cluster.Bootstrap can cause network socket port exhaustion due to TCP protocol holding the socket port open in the WAIT_TIME linger state if Cluster.Bootstrap failed to form a cluster immediately.

This has been observed especially in conjunction with Akka.Discovery.Azure.

Arkatufus commented 1 week ago

Problem isolated to the Ceen.Httpd package. Socket usage was stable when Ceen was replaced with Kestrel.

PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 758
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 760
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 753
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 741
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 754
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 760
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 760
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 749
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 742
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 752
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 760
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 760
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 750
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 742
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 752
PS D:\git\akkadotnet\Akka.Management\src\discovery\examples\SocketLeakTest> .\netstat.ps1
Total number of open connections with local or foreign address port in the range 15885-16000: 760

Aaronontheweb commented 1 week ago

Update here since I've been investigating this using our test lab - we can reproduce this issue on Linux too, and it's making me think that the problem might be related to how aggressively we try to re-connect during cluster formation.

Cluster Fully Formed (20/20 nodes)

Only about ~20 active TCP connections per node, which makes sense - most of these are Akka.Remote, an OTLP exporter, and maybe a few others

lab-experiment-1-control

Cluster Unable to Form (18/20 nodes)

About ~1100 active TCP connections per node. This looks like hyper-aggressive retries, not some kind of TCP handling issue.

lab-experiment-1

Aaronontheweb commented 1 week ago

Another piece of evidence in favor of the "aggressive retries" theory of the case, look at the step function of active TCP connections when cluster formation does occur:

The oldest nodes have significantly more open TCP connections than the newer nodes that were started later during the deployment by Kubernetes. This looks more like a "Thundering Herd" problem rather than a resource leak.

Aaronontheweb commented 6 days ago

We did some more work on this over the weekend and captured more data from more experiments - the problem is definitely caused by how frequently Akka.Management's cluster bootstrapper is HTTP-polling its peers:

TCP Connectivity Data

1s interval - ~1100 connections per node

5s interval - ~260-280 connections per node

10s interval - ~100-105 connections per node

The key setting at play here is the akka.management.cluster.bootstrap.contact-point.probe-interval , which defaults to 1s. If we increase it to 5s we see a much smaller number of concurrent TCP connections.

Cluster Formation Times

`akka.management.cluster.bootstrap.contact-point.probe-interval = 1s`

Running a 22 node cluster using Akka.Discovery.KubernetesApi, we see the following end to end cluster formation times with akka.management.cluster.bootstrap.contact-point.probe-interval = 1s, the default. We also have a hard 20-nodes-must-be-up requirement configured for Cluster.Boostrap, so cluster formation can't occur until the 20th node has come online.

It takes about an average of 30s for a cluster to fully form - this is mostly due to the amount of time it takes Kubernetes to spin up all of the pods. The oldest nodes in the cluster have a longer average and the youngest ones have a shorter one, hence why you see this time distribution.

`akka.management.cluster.bootstrap.contact-point.probe-interval = 5s`

Same exact environment / reproduction sample as before, just with the probing interval set to 5s:

The cluster never forms, and this is apparently due to a bug in the logic around "timing out" the freshness of a node's last healthy check-in - the configuration we use for this setting is totally independent of the polling interval and that is a bug.

Next Steps

@Arkatufus already identified this issue and is preparing a fix for it. If that fix works, then the port exhaustion problems can be addressed by just increasing the probing interval. We are going to test this in our lab and confirm before making concrete recommendations to affected users. Just wanted to post an update to let everyone know that this is being urgently addressed.

Aaronontheweb commented 6 days ago

One other setting that can alleviate major stressors that contribute to this port exhaustion problem:

https://github.com/akkadotnet/Akka.Management/blob/d7812ffbf371e8a3510bd296c9f7d14de050ef4e/src/management/Akka.Management/Resources/reference.conf#L151-L156

Set that to false and this will also significantly reduce the amount of TCP traffic. I'll put up some data supporting that in the next day or so as well. Changing this setting can, in theory, open the possibly of a split brain forming but IMHO that should be quite rate in practice.

Aaronontheweb commented 6 days ago

Aaronontheweb commented 5 days ago

Should we just change the default polling interval to 5s - that should help put this issue to bed

akkadotnet / Akka.Management

Cluster.Bootstrap causes network socket port exhaustion due to socket leak during cluster formation #2571

Cluster Fully Formed (20/20 nodes)

Cluster Unable to Form (18/20 nodes)

TCP Connectivity Data

1s interval - ~1100 connections per node

5s interval - ~260-280 connections per node

10s interval - ~100-105 connections per node

Cluster Formation Times

`akka.management.cluster.bootstrap.contact-point.probe-interval = 1s`

`akka.management.cluster.bootstrap.contact-point.probe-interval = 5s`

Next Steps