Closed rnwgnr closed 2 years ago
We've been able to reproduce the same issue on our end. The question is why the MaxReconnects was changed to -1?
The problem is that exponentialReconnectWait
can be negative or zero due to arithmetic overflow. That condition is not detected by the subsequent if-statement. Instead, the code wrongly assumes that exponentialReconnectWait
has a positive value, but smaller than natsMaxReconnectWait
(i.e. 0 < exponentialReconnectWait
< natsMaxReconnectWait
).
exponentialReconnectWait := time.Duration(math.Pow(natsMinRetryWait, float64(attempts))) * time.Second
if natsMaxReconnectWait > exponentialReconnectWait {
h.logger.Debug(natsHandlerLogTag, "Increased reconnect to: %v", exponentialReconnectWait)
return exponentialReconnectWait
}
An easy fix would be limiting attempts
to a safe value that avoids arithmetic overflow, e.g.:
exponentialReconnectWait := time.Duration(math.Pow(natsMinRetryWait, math.Min(16.0, float64(attempts)))) * time.Second
EDIT: On second thought, the idea of doing exponential backoff and making arithmetic overflow less likely to occur would be better solved by moving natsMinRetryWait
outside of math.Pow()
:
exponentialReconnectWait := time.Duration(natsMinRetryWait * math.Pow(2.0, math.Min(16.0, float64(attempts - 1)))) * time.Second
(attempts
starts with a value of 1, thus attempts - 1
-> math.Pow(2.0, 0.0) == 1.0
)
@rkoster The EasyCLA step is now covered in the pull request above.
@daniel-hoefer as above has been merged and added to the latest stemcells
@ramonskie as far as we can see in the release notes the stemcells bionic-1.107 and jammy-1.18 contain bosh agent in version 2.468.0 (which does not include the merge). Only 2.469.0 contains the fixed code, but this version did not yet make it into a new stemcell...
woops my mistake i will try to release a new one at the end of the week
Hi @ramonskie,
woops my mistake i will try to release a new one at the end of the week
is there any ETA for a new stemcell? Sorry for asking, but we are not able to update / recreate / modify our bosh directors at the moment, because the current bosh agent still floods our infrastructure when the director is not available.
Thanks in advance! Sebastian
a new stemcell has been released which include the new agent https://github.com/cloudfoundry/bosh-linux-stemcell-builder/releases/tag/ubuntu-bionic%2Fv1.115
@ramonskie the problem is solved. The nats reconnect timeout is increased up to 10s and remains at this value until nats is available again. You can close this issue, thanks for the support
Regards, Sebastian
For the sake of completeness: here is a graph similar to the one from @guzzisti in the initial post, showing the number of connection attempts per second from a single client to the director.
Instead of hundreds of connection attempts per client per second, the behaviour is now back to normal, i.e. 1 connection attempt every 10s (and initially somewhat more frequent).
observed behaviour
When a director is down for maintenance, every VM with a bosh agent starts to flood the director with connection requests to port 4222 after a while (~ 5 minutes in our environment). We see up to a several hundred connection attemps per second (sic!) from every vm managed by bosh. After several minutes of packet storm it will calm down, just to start again after a short time of normal operation. The graph shows the number of connection attempts by a single vm per second:
This causes severe load on the network components, depending on how large your bosh environment is.
The bosh agent log shows the section:
From our investigation this regression may have been introduced with https://github.com/cloudfoundry/bosh-agent/commit/a61dd8cd0aaa1cc38c903eee214089fd46040748 when the nats.CustomReconnectDelay has been modified. We asume that after 32 failed connection attempts, the
exponentialReconnectWait
runs into an IntegerOverflow. As negative values are not handled in the code afterwards, the reconnect interval is set to 0.Tested with stemcell 1.92 and 1.97