cloudfoundry / bosh-agent

BOSH Agent runs on each BOSH deployed VM
Apache License 2.0
58 stars 115 forks source link

connection flooding from bosh-agents whilst director is unavailable #289

Closed rnwgnr closed 2 years ago

rnwgnr commented 2 years ago

observed behaviour

When a director is down for maintenance, every VM with a bosh agent starts to flood the director with connection requests to port 4222 after a while (~ 5 minutes in our environment). We see up to a several hundred connection attemps per second (sic!) from every vm managed by bosh. After several minutes of packet storm it will calm down, just to start again after a short time of normal operation. The graph shows the number of connection attempts by a single vm per second: image

This causes severe load on the network components, depending on how large your bosh environment is.

The bosh agent log shows the section:

2022-09-07_12:27:56.70086 [NATS Handler] 2022/09/07 12:27:56 DEBUG - Increased reconnect to: 2s
2022-09-07_12:28:00.70185 [NATS Handler] 2022/09/07 12:28:00 DEBUG - Increased reconnect to: 4s
2022-09-07_12:28:06.70327 [NATS Handler] 2022/09/07 12:28:06 DEBUG - Increased reconnect to: 8s
2022-09-07_12:28:16.70370 [NATS Handler] 2022/09/07 12:28:16 DEBUG - Increased reconnect to: 10s
2022-09-07_12:28:28.70543 [NATS Handler] 2022/09/07 12:28:28 DEBUG - Increased reconnect to: 10s
2022-09-07_12:28:40.71067 [NATS Handler] 2022/09/07 12:28:40 DEBUG - Increased reconnect to: 10s
2022-09-07_12:28:52.71286 [NATS Handler] 2022/09/07 12:28:52 DEBUG - Increased reconnect to: 10s
2022-09-07_12:29:02.71468 [NATS Handler] 2022/09/07 12:29:02 DEBUG - Increased reconnect to: 10s
2022-09-07_12:29:12.71673 [NATS Handler] 2022/09/07 12:29:12 DEBUG - Increased reconnect to: 10s
2022-09-07_12:29:22.71787 [NATS Handler] 2022/09/07 12:29:22 DEBUG - Increased reconnect to: 10s
2022-09-07_12:29:32.72396 [NATS Handler] 2022/09/07 12:29:32 DEBUG - Increased reconnect to: 10s
2022-09-07_12:29:42.72867 [NATS Handler] 2022/09/07 12:29:42 DEBUG - Increased reconnect to: 10s
2022-09-07_12:29:52.73361 [NATS Handler] 2022/09/07 12:29:52 DEBUG - Increased reconnect to: 10s
2022-09-07_12:30:02.74002 [NATS Handler] 2022/09/07 12:30:02 DEBUG - Increased reconnect to: 10s
2022-09-07_12:30:12.74612 [NATS Handler] 2022/09/07 12:30:12 DEBUG - Increased reconnect to: 10s
2022-09-07_12:30:22.74793 [NATS Handler] 2022/09/07 12:30:22 DEBUG - Increased reconnect to: 10s
2022-09-07_12:30:32.75044 [NATS Handler] 2022/09/07 12:30:32 DEBUG - Increased reconnect to: 10s
2022-09-07_12:30:42.75144 [NATS Handler] 2022/09/07 12:30:42 DEBUG - Increased reconnect to: 10s
2022-09-07_12:30:52.75313 [NATS Handler] 2022/09/07 12:30:52 DEBUG - Increased reconnect to: 10s
2022-09-07_12:31:02.75442 [NATS Handler] 2022/09/07 12:31:02 DEBUG - Increased reconnect to: 10s
2022-09-07_12:31:12.75577 [NATS Handler] 2022/09/07 12:31:12 DEBUG - Increased reconnect to: 10s
2022-09-07_12:31:22.75706 [NATS Handler] 2022/09/07 12:31:22 DEBUG - Increased reconnect to: 10s
2022-09-07_12:31:32.75829 [NATS Handler] 2022/09/07 12:31:32 DEBUG - Increased reconnect to: 10s
2022-09-07_12:31:42.76384 [NATS Handler] 2022/09/07 12:31:42 DEBUG - Increased reconnect to: 10s
2022-09-07_12:31:52.76894 [NATS Handler] 2022/09/07 12:31:52 DEBUG - Increased reconnect to: 10s
2022-09-07_12:32:02.77027 [NATS Handler] 2022/09/07 12:32:02 DEBUG - Increased reconnect to: 10s
2022-09-07_12:32:12.77609 [NATS Handler] 2022/09/07 12:32:12 DEBUG - Increased reconnect to: 10s
2022-09-07_12:32:22.77726 [NATS Handler] 2022/09/07 12:32:22 DEBUG - Increased reconnect to: 10s
2022-09-07_12:32:32.77859 [NATS Handler] 2022/09/07 12:32:32 DEBUG - Increased reconnect to: 10s
2022-09-07_12:32:42.78359 [NATS Handler] 2022/09/07 12:32:42 DEBUG - Increased reconnect to: 10s
2022-09-07_12:32:52.78978 [NATS Handler] 2022/09/07 12:32:52 DEBUG - Increased reconnect to: 10s
2022-09-07_12:33:02.79493 [NATS Handler] 2022/09/07 12:33:02 DEBUG - Increased reconnect to: 10s
2022-09-07_12:33:12.80056 [NATS Handler] 2022/09/07 12:33:12 DEBUG - Increased reconnect to: 10s
2022-09-07_12:33:22.80189 [NATS Handler] 2022/09/07 12:33:22 DEBUG - Increased reconnect to: -351909h41m29.709551616s
2022-09-07_12:33:22.80310 [NATS Handler] 2022/09/07 12:33:22 DEBUG - Increased reconnect to: -703819h22m59.419103232s
2022-09-07_12:33:22.80390 [NATS Handler] 2022/09/07 12:33:22 DEBUG - Increased reconnect to: -1407638h45m58.838206464s
2022-09-07_12:33:22.80487 [NATS Handler] 2022/09/07 12:33:22 DEBUG - Increased reconnect to: 10s
2022-09-07_12:33:32.81099 [NATS Handler] 2022/09/07 12:33:32 DEBUG - Increased reconnect to: -506459h29m21.64327424s
2022-09-07_12:33:32.81196 [NATS Handler] 2022/09/07 12:33:32 DEBUG - Increased reconnect to: -1012918h58m43.28654848s
2022-09-07_12:33:32.81315 [NATS Handler] 2022/09/07 12:33:32 DEBUG - Increased reconnect to: -2025837h57m26.57309696s
2022-09-07_12:33:32.81470 [NATS Handler] 2022/09/07 12:33:32 DEBUG - Increased reconnect to: 10s
2022-09-07_12:33:42.81988 [NATS Handler] 2022/09/07 12:33:42 DEBUG - Increased reconnect to: 10s
2022-09-07_12:33:52.82104 [NATS Handler] 2022/09/07 12:33:52 DEBUG - Increased reconnect to: -834416h55m51.456120832s
2022-09-07_12:33:52.82204 [NATS Handler] 2022/09/07 12:33:52 DEBUG - Increased reconnect to: -1668833h51m42.912241664s
2022-09-07_12:33:52.82302 [NATS Handler] 2022/09/07 12:33:52 DEBUG - Increased reconnect to: 10s
2022-09-07_12:34:02.82422 [NATS Handler] 2022/09/07 12:34:02 DEBUG - Increased reconnect to: -1551239h52m17.93941504s
2022-09-07_12:34:02.82522 [NATS Handler] 2022/09/07 12:34:02 DEBUG - Increased reconnect to: 10s
2022-09-07_12:34:12.83129 [NATS Handler] 2022/09/07 12:34:12 DEBUG - Increased reconnect to: -1080863h54m38.048108544s
2022-09-07_12:34:12.83221 [NATS Handler] 2022/09/07 12:34:12 DEBUG - Increased reconnect to: -2161727h49m16.096217088s
2022-09-07_12:34:12.83325 [NATS Handler] 2022/09/07 12:34:12 DEBUG - Increased reconnect to: 10s
2022-09-07_12:34:22.83468 [NATS Handler] 2022/09/07 12:34:22 DEBUG - Increased reconnect to: 10s
2022-09-07_12:34:32.84009 [NATS Handler] 2022/09/07 12:34:32 DEBUG - Increased reconnect to: -1921535h50m27.641081856s
2022-09-07_12:34:32.84115 [NATS Handler] 2022/09/07 12:34:32 DEBUG - Increased reconnect to: 10s
2022-09-07_12:34:42.84485 [NATS Handler] 2022/09/07 12:34:42 DEBUG - Increased reconnect to: -2562047h47m16.854775808s
2022-09-07_12:34:42.84585 [NATS Handler] 2022/09/07 12:34:42 DEBUG - Increased reconnect to: 0s
2022-09-07_12:34:42.84687 [NATS Handler] 2022/09/07 12:34:42 DEBUG - Increased reconnect to: 0s
2022-09-07_12:34:42.84791 [NATS Handler] 2022/09/07 12:34:42 DEBUG - Increased reconnect to: 0s
2022-09-07_12:34:42.84919 [NATS Handler] 2022/09/07 12:34:42 DEBUG - Increased reconnect to: 0s
2022-09-07_12:34:42.85035 [NATS Handler] 2022/09/07 12:34:42 DEBUG - Increased reconnect to: 0s

From our investigation this regression may have been introduced with https://github.com/cloudfoundry/bosh-agent/commit/a61dd8cd0aaa1cc38c903eee214089fd46040748 when the nats.CustomReconnectDelay has been modified. We asume that after 32 failed connection attempts, the exponentialReconnectWait runs into an IntegerOverflow. As negative values are not handled in the code afterwards, the reconnect interval is set to 0.

Tested with stemcell 1.92 and 1.97

nmaurer23 commented 2 years ago

We've been able to reproduce the same issue on our end. The question is why the MaxReconnects was changed to -1?

daniel-hoefer commented 2 years ago

The problem is that exponentialReconnectWait can be negative or zero due to arithmetic overflow. That condition is not detected by the subsequent if-statement. Instead, the code wrongly assumes that exponentialReconnectWait has a positive value, but smaller than natsMaxReconnectWait (i.e. 0 < exponentialReconnectWait < natsMaxReconnectWait).

exponentialReconnectWait := time.Duration(math.Pow(natsMinRetryWait, float64(attempts))) * time.Second
if natsMaxReconnectWait > exponentialReconnectWait {
  h.logger.Debug(natsHandlerLogTag, "Increased reconnect to: %v", exponentialReconnectWait)
  return exponentialReconnectWait
}

An easy fix would be limiting attempts to a safe value that avoids arithmetic overflow, e.g.:

exponentialReconnectWait := time.Duration(math.Pow(natsMinRetryWait, math.Min(16.0, float64(attempts)))) * time.Second

EDIT: On second thought, the idea of doing exponential backoff and making arithmetic overflow less likely to occur would be better solved by moving natsMinRetryWait outside of math.Pow():

exponentialReconnectWait := time.Duration(natsMinRetryWait * math.Pow(2.0, math.Min(16.0, float64(attempts - 1)))) * time.Second

(attempts starts with a value of 1, thus attempts - 1 -> math.Pow(2.0, 0.0) == 1.0)

daniel-hoefer commented 2 years ago

@rkoster The EasyCLA step is now covered in the pull request above.

ramonskie commented 2 years ago

@daniel-hoefer as above has been merged and added to the latest stemcells

schindlersebastian commented 2 years ago

@ramonskie as far as we can see in the release notes the stemcells bionic-1.107 and jammy-1.18 contain bosh agent in version 2.468.0 (which does not include the merge). Only 2.469.0 contains the fixed code, but this version did not yet make it into a new stemcell...

ramonskie commented 2 years ago

woops my mistake i will try to release a new one at the end of the week

schindlersebastian commented 2 years ago

Hi @ramonskie,

woops my mistake i will try to release a new one at the end of the week

is there any ETA for a new stemcell? Sorry for asking, but we are not able to update / recreate / modify our bosh directors at the moment, because the current bosh agent still floods our infrastructure when the director is not available.

Thanks in advance! Sebastian

ramonskie commented 2 years ago

a new stemcell has been released which include the new agent https://github.com/cloudfoundry/bosh-linux-stemcell-builder/releases/tag/ubuntu-bionic%2Fv1.115

schindlersebastian commented 2 years ago

@ramonskie the problem is solved. The nats reconnect timeout is increased up to 10s and remains at this value until nats is available again. You can close this issue, thanks for the support

Regards, Sebastian

daniel-hoefer commented 2 years ago

For the sake of completeness: here is a graph similar to the one from @guzzisti in the initial post, showing the number of connection attempts per second from a single client to the director.

image

Instead of hundreds of connection attempts per client per second, the behaviour is now back to normal, i.e. 1 connection attempt every 10s (and initially somewhat more frequent).