BOSH DNS healthcheck server spammed with requests from other VMs

videlov commented 5 years ago

Symptoms: On our big landscapes (1000 instances), when rotating dns_healthcheck certificates (step 1 of the guide https://bosh.io/docs/dns/#rotating-dns-certificates) the api instances of cf go up and stay at 100% cpu usage.

The cpu is used by the bosh-dns-healthcheck process.

Output from bosh_dns_health.stderr.log on the high cpu vm:

bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.65.98:53618: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.75.167:52964: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.139.134:50508: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.75.99:34978: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.74.87:34814: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.65.72:60698: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.74.194:45532: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.65.98:58472: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.138.231:52474: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.129.112:47924: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.75.120:35912: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: Accept error: accept tcp 0.0.0.0:8853: accept4: too many open files; retrying in 5ms
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.65.115:53082: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.75.103:33612: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.129.172:45088: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.138.153:47992: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.74.38:47952: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.74.60:56856: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.129.113:53130: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.139.7:35220: EOF
bosh_dns_health.stderr.log:2019/08/30 12:05:29 http: TLS handshake error from 10.0.65.98:37174: EOF

Analysis: It seems that other vms try to contact the api instances to check their health. For some reason that fails with a TLS Handshake error. Other vms don't wait but retry immediately. Client 10.0.65.98 for example tries to connect 3 times in the same second. This all is strange as step 1 only concatenates CAs we would expect TLS connections to succeed.

Workaround: The vms stay on the high cpu usage for several hours. After half a day the usage slowly goes down by itself. Alternatively, restarting the bosh-dns-healthcheck process solves it.

Question: Are you aware of such an issue, and how could we solve / prevent this from happening in the future?

Coding: Should we log errors when the health check requests fail? here

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/168248936

The labels on this github issue will be updated when the story is started.

rkoster commented 5 years ago

We observe similar behaviour in a single az cluster (small cf deployment):

2019/09/05 04:59:31 http: TLS handshake error from 10.1.2.49:35572: read tcp 10.1.2.25:8853->10.1.2.49:35572: read: connection reset by peer
2019/09/05 05:00:44 http: TLS handshake error from 10.1.2.49:38142: EOF
2019/09/05 05:03:09 http: TLS handshake error from 10.1.3.0:59910: EOF
2019/09/05 05:03:13 http: TLS handshake error from 10.1.3.0:35044: EOF
2019/09/05 05:03:13 http: TLS handshake error from 10.1.3.0:33886: EOF
2019/09/05 05:03:25 http: TLS handshake error from 10.1.3.0:60864: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:45692: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:44330: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:41680: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:42994: EOF
2019/09/05 05:06:33 http: TLS handshake error from 10.1.3.0:57556: EOF
2019/09/05 05:06:33 http: TLS handshake error from 10.1.3.0:58936: EOF
2019/09/05 05:06:44 http: TLS handshake error from 10.1.3.0:33870: EOF
2019/09/05 05:06:44 http: TLS handshake error from 10.1.3.0:60450: EOF
2019/09/05 05:07:01 http: TLS handshake error from 10.1.3.0:42704: EOF
2019/09/05 05:09:00 http: TLS handshake error from 10.1.3.0:41248: EOF
2019/09/05 05:09:00 http: TLS handshake error from 10.1.3.0:43260: EOF
2019/09/05 05:09:18 http: TLS handshake error from 10.1.3.0:45094: EOF
2019/09/05 05:09:18 http: TLS handshake error from 10.1.3.0:47282: EOF
2019/09/05 05:09:26 http: TLS handshake error from 10.1.3.0:56672: EOF
2019/09/05 05:09:26 http: TLS handshake error from 10.1.3.0:34906: EOF
2019/09/05 05:09:27 http: TLS handshake error from 10.1.3.0:58810: EOF
2019/09/05 05:09:36 http: TLS handshake error from 10.1.3.0:2974: EOF
2019/09/05 05:10:26 http: TLS handshake error from 10.1.3.0:47334: EOF

We also notice the bosh dns process is not responding to dns requests anymore:

 dig +tcp +time=10 +tries=1 -p 53 upcheck.bosh-dns. @169.254.0.2
;; Connection to 169.254.0.2#53(169.254.0.2) for upcheck.bosh-dns. failed: connection refused.

Monit however thinks everything is fine:

monit summary
The Monit daemon 5.2.5 uptime: 19h 34m

Process 'garden'                    running
Process 'rep'                       running
Process 'route_emitter'             running
Process 'netmon'                    running
Process 'vxlan-policy-agent'        Does not exist
Process 'silk-daemon'               running
Process 'loggr-udp-forwarder'       running
Process 'loggregator_agent'         running
Process 'loggr-forwarder-agent'     running
Process 'prom_scraper'              running
Process 'bosh-dns'                  running
Process 'bosh-dns-resolvconf'       running
Process 'bosh-dns-healthcheck'      running
System 'system_localhost'           running

At the time of our investigation the health checks already stopped (last one was 2 hours ago) but still bosh dns was not responding. Restarting bosh-dns with monit "fixed" the issue.

rkoster commented 5 years ago

on other nodes I found bosh-dns being up but monit gave a status of execution failed. Monit logs did not say way it was in this state:

[UTC Sep  5 05:14:02] error    : 'bosh-dns' process with pid 3057 is a zombie
[UTC Sep  5 05:15:05] error    : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep  5 05:15:35] error    : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep  5 05:15:35] error    : Alert handler failed, retry scheduled for next cycle
[UTC Sep  5 05:15:50] error    : 'bosh-dns' process is not running
[UTC Sep  5 05:16:10] error    : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep  5 05:16:26] info     : 'bosh-dns' trying to restart
[UTC Sep  5 05:16:26] info     : 'bosh-dns' start: /var/vcap/jobs/bosh-dns/bin/bosh_dns_ctl
[UTC Sep  5 05:18:42] error    : 'bosh-dns' failed to start
[UTC Sep  5 05:19:14] error    : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep  5 05:19:24] info     : 'bosh-dns' process is running with pid 3560
[UTC Sep  5 05:19:24] info     : 'bosh-dns' check process state succeeded

But does show it detected a zombie 'bosh-dns' process with pid 3057 is a zombie.

beyhan commented 5 years ago

Further looking into this we suspect, that the root cause could be in the configured timeouts and intervals for heathcheck and retry:

the http/client timeouts are configured here
the healthcheck intervals is configured here
the retry is configured here

On a VM with this issue, where the CPU load is always on 100% we observed following: There are too many healthcheck connections in close wait

netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 established)
      1 Foreign
      2 SYN_SENT
     26 LISTEN
     35 TIME_WAIT
     55 SYN_RECV
    747 ESTABLISHED
   3699 CLOSE_WAIT

On a different VM, which tries to get the health status of the VM under load we observe too many parallel healthcheck requests:

Every 0.1s: ss -t -a | grep 10.0.1.2                                                                                                                                                                                                                                                           

SYN-SENT   0      1      10.0.1.140730                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140434                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140522                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140392                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140436                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140724                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140004                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140718                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140092                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140716                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140246                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140720                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140394                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140792                10.0.1.2:8853
SYN-SENT   0      1      10.0.1.140396                10.0.1.2:8853

Our theory is that the VM with the issue is under high load and can't handle the connections. So we observed manly connections in close wait on that VM:

Every 0.1s: ss -t -a | grep 10.0.1.1                                                                                                                                                                                                                                                             

CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:57316
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:56546
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:59310
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:58100
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:57308
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:57298
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:57310
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:57314
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:57318
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:53518
CLOSE-WAIT 1      0      10.0.1.2:8853                 10.0.1.1:33152

The open file handle limits for the healthcheck process, where 4096. Here we observe the too many open files error, which is a limit how many requests the server can process. We actually increased it to 20000, but this didn't help to resolve the situation. All file handles were taken after short period of time.

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/168934149

The labels on this github issue will be updated when the story is started.

christarazi commented 5 years ago

Hey @beyhan @rkoster @videlov ,

We committed a potential fix. Can you try to see if that alleviates the symptoms you see?

Thanks, @christarazi @jaresty

videlov commented 5 years ago

With the fix we see that the file handles usage drops. We are not sure wether it will solve our issue completely. Further improvements could be:

Do you think logging /health request errors make sense here - https://github.com/cloudfoundry/bosh-dns-release/blob/7ccbf00958d6ca50060495722d7ae29d78ed41d6/src/bosh-dns/dns/server/healthiness/health_checker.go#L34
What do you think making check interval and retry configurable in the release: https://github.com/cloudfoundry/bosh-dns-release/blob/2ea64708b57190a8fead8d5e877bf9a6c3423cb1/jobs/bosh-dns/templates/config.json.erb#L25 https://github.com/cloudfoundry/bosh-dns-release/blob/7ccbf00958d6ca50060495722d7ae29d78ed41d6/src/bosh-dns/tlsclient/client.go#L71-L74
Why is the http connection timeout reduced to 5 seconds for the healthchecker: https://github.com/cloudfoundry/bosh-dns-release/blob/c2eb9e317e445adc65b8159b14c16fbbd9a4f73a/src/bosh-dns/tlsclient/client.go#L38 Would this eventually help on the EOF problems?

cc @beyhan

beyhan commented 5 years ago

Configurable ulimts could also make sense https://github.com/cloudfoundry/bosh-dns-release/blob/c2eb9e317e445adc65b8159b14c16fbbd9a4f73a/jobs/bosh-dns/templates/bosh_dns_health_ctl.erb#L44

videlov commented 5 years ago

Hi @christarazi @jaresty , we are going to test your changes and report back to you.

When do you plan to cut a new release for this repo? This removes the burden of maintaining a fork for us.

Thanks, @BeckerMax @videlov

jaresty commented 5 years ago

If the fix seems good to you then we can cut a release. Please let us know how it goes. Thanks!

beyhan commented 5 years ago

Hi @jaresty,

We validated the provided fix and it doesn't fix the issue. We don't see any change in the behavior of the health endpoint. Our understanding why we get into this situation is as following:

a VM's health endpoint can't respond to the requests for some time (e.g. because of an update)
every DNS query for healthy instances on a VM will generate a new health check request towards the unresponsive VM
when the health endpoint is responsive again, the queue of requests is so long, that it can't handle it, because the CPU overhead for a mTLS handshake is quite high.

A solution could be to avoid multiple parallel check requests against the same target health server from the same client VM. The idea is to have a single active request to a health endpoint not multiple in parallel, because the requests are quite expensive with mTLS. Does this make sense?

cc @videlov

jaresty commented 5 years ago

Hi @beyhan -

Thanks for checking this out for us. Your suggestion makes sense based on the behavior you described above. I am a bit surprised that it is possible for the requests to stack up in that way given that the timeout (5s) is less than the check interval (20s). I would have expected these health check requests to timeout instead of stacking up as described. Can you share a bit more about your investigation so we can understand how to reproduce it locally?

Thanks, Josh

videlov commented 5 years ago

Steps to reproduce the issue on a machine with up to 4 CPUs and 6 GB RAM:

Deploy a bosh-lite with the steps from https://bosh.io/docs/bosh-lite/
Deploy zookeeper as described in https://bosh.io/docs/bosh-lite/#deploy with 10 instances instead of 5
Select a zookeeper instance and ssh into it.
Execute iptables -A INPUT -p tcp --destination-port 8853 -j DROP to simulate an update.
Wait until the health state for the instance becomes unknown. You may check from another zookeeper instance with: /var/vcap/jobs/bosh-dns/bin/cli instances
Execute dns queries from all other instances with bosh is -d zookeeper | grep -v <instance-id> | grep -v smoke-tests | awk '{print $1}' | xargs -I {} -P 9 bosh -d zookeeper ssh {} -c "for i in {1..100}; do printf 'dig q-s3-i<instance-bosh-index>.zookeeper.*.*.bosh; echo %s;\\n' {0..200} | xargs -P 200 -I % sh -c '%'; done"
Wait 10sec and drop the iptable rule iptables -D INPUT -p tcp --destination-port 8853 -j DROP
Check that the instance gets too many request and reaches file handles limit: tail -f /var/vcap/sys/log/bosh-dns/bosh_dns_health.std* will show many error, e.g.:

2019/10/28 12:32:58 http: Accept error: accept tcp 0.0.0.0:8853: accept4: too many open files; retrying in 5ms 2019/10/28 12:33:19 http: TLS handshake error from 10.244.0.4:47990: EOF

lsof -p <dns-health-process-pid> | wc -l netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n

The instance health status should stay in unknown state and the dig dns queries should not be able to resolve this instance.

cc @beyhan

jaresty commented 5 years ago

Thanks @videlov for the detailed reproduction steps. Using these steps I was indeed able to reproduce the error described. I've submitted a fix which seems to resolve the issue based on the reproduction steps you shared. Can you try it out and let us know how it goes?

jli1972 commented 4 years ago

Hi Jaresty, To which version / tile is the fixed applied ?

Thanks.

Kelvin

beyhan commented 4 years ago

I can confirm that the provided fix reduces the request load and works fine. Clients do now one request towards unhealthy instances or four in parallel, when a retry happens.

We observe now higher CPU load with the dns version v1.15.0 plus the fix for request batching compared to the old version 1.12.0, which we head (Unfortunately, we updated also the dns version, which makes isolation of the performance regression more complicated, that is why I'm adding the comment here, but we can open a different issue for it). By higher CPU load I mean following:

dns-health process in version v1.15.0 plus the request batching fix uses 30% to 40% CPU for 300 to 400 incoming req/s
dns-health process in version v1.12.0 uses 10% to 20% CPU for 300 to 400 incoming req/s

Can you thing of any change, which can cause this? Do you also see any change in the CPU usage with v1.15.0 compared to older versions?

beyhan commented 4 years ago

I have opened issue https://github.com/cloudfoundry/bosh-dns-release/issues/52 for the performance regression in the health endpoint. I verified, that the performance issue is not related to the fix provided here for request batching and was introduced with version v1.15.0.

friegger commented 4 years ago

@beyhan From my understanding the fix provided above fixes this issue by batching requests to the same health endpoint. Do we still see further improvement options like different retry patterns or using connections with keep_alive to avoid expensive certificate evaluation? @jaresty What do you think about the latter?

beyhan commented 4 years ago

Reusing connections will reduce the CPU load a lot, because as you can see in this issue the health-server spends a lot of time with handshakes.

I think that we don't need retries, because the check interval is quite low 20s. Avoiding retries will only help in situation, where the health server is overloaded. It doesn't help in a normal situation, where the server can handle the load.

I don't see potential for other improvements at the moment.

cjnosal commented 4 years ago

I've made several changes that were touched on in this thread: https://github.com/cloudfoundry/bosh-dns-release/compare/383808ccc7f5e5fe72fc44fe936ed14961d6a15e...de827faf9736a5ecd43810636234e06025946359

health.request_timeout, health.local_health_interval, health.remote_health_interval, health.synchronous_check_timeout are now configurable in job spec. (https://www.pivotaltracker.com/story/show/170479480)
DNS lookups will no longer trigger a health request if cached health information is already available (it will be refreshed on the next remote_health_interval) (https://www.pivotaltracker.com/story/show/170506259)
remove network retries on each remote_health_interval health requests (https://www.pivotaltracker.com/story/show/170158998)
log health responses at DEBUG level (https://www.pivotaltracker.com/story/show/170477655)

In my small scale testing (12 n1-standard-4 VMs spamming each other with 250 instance group lookups) overall CPU usage dropped from 150% to 30% compared to 1.16.

I didn't look at file handle/ulimit changes, but now that health checks scale with number of domains instead of number of requests that bottleneck is hopefully relieved.

Build candidate https://s3.amazonaws.com/bosh-dns-release-candidate-tarballs/bosh-dns-dev-release.tgz?versionId=ihp_vYdf999PgoR2xiSAjyNvQVLqrD40

beyhan commented 4 years ago

Thanks! We will give it a try!

cjnosal commented 4 years ago

Delivered in v1.17.0

cloudfoundry / bosh-dns-release

BOSH DNS healthcheck server spammed with requests from other VMs #50