[Feature request] Distribute health checks over the remote health interval

beyhan commented 4 years ago

We observe that the bosh-dns server does all health checks, which are currently active at once. This has the impact that the bosh-dns process produces CPU spikes, which can impact other processes on the same VM. It will be better to distribute the checks over the remote_health_interval.

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/172245015

The labels on this github issue will be updated when the story is started.

cjnosal commented 4 years ago

One option that might help is to set the HealthFilter/HealthWatcher work pool size in the job spec. When adjusting the value it would be important that the work pool can get through all requests before remote_health_interval, otherwise a backlog would accumulate.

https://github.com/cloudfoundry/bosh-dns-release/blob/master/src/bosh-dns/dns/main.go#L145 https://github.com/cloudfoundry/bosh-dns-release/blob/master/src/bosh-dns/dns/server/records/health_filter.go#L40

Triggering health checks at different times (instead of triggering everything on one timer) could work but will likely involve more complexity.

I don't have access to large environments to verify change in cpu load these changes would hopefully offer, so I'd appreciate it if you could prove that out.

bosh-admin-bot commented 3 years ago

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

bosh-admin-bot commented 3 years ago

This issue was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.

cloudfoundry / bosh-dns-release

[Feature request] Distribute health checks over the remote health interval #58