Closed videlov closed 4 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/168248936
The labels on this github issue will be updated when the story is started.
We observe similar behaviour in a single az cluster (small cf deployment):
2019/09/05 04:59:31 http: TLS handshake error from 10.1.2.49:35572: read tcp 10.1.2.25:8853->10.1.2.49:35572: read: connection reset by peer
2019/09/05 05:00:44 http: TLS handshake error from 10.1.2.49:38142: EOF
2019/09/05 05:03:09 http: TLS handshake error from 10.1.3.0:59910: EOF
2019/09/05 05:03:13 http: TLS handshake error from 10.1.3.0:35044: EOF
2019/09/05 05:03:13 http: TLS handshake error from 10.1.3.0:33886: EOF
2019/09/05 05:03:25 http: TLS handshake error from 10.1.3.0:60864: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:45692: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:44330: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:41680: EOF
2019/09/05 05:03:50 http: TLS handshake error from 10.1.3.0:42994: EOF
2019/09/05 05:06:33 http: TLS handshake error from 10.1.3.0:57556: EOF
2019/09/05 05:06:33 http: TLS handshake error from 10.1.3.0:58936: EOF
2019/09/05 05:06:44 http: TLS handshake error from 10.1.3.0:33870: EOF
2019/09/05 05:06:44 http: TLS handshake error from 10.1.3.0:60450: EOF
2019/09/05 05:07:01 http: TLS handshake error from 10.1.3.0:42704: EOF
2019/09/05 05:09:00 http: TLS handshake error from 10.1.3.0:41248: EOF
2019/09/05 05:09:00 http: TLS handshake error from 10.1.3.0:43260: EOF
2019/09/05 05:09:18 http: TLS handshake error from 10.1.3.0:45094: EOF
2019/09/05 05:09:18 http: TLS handshake error from 10.1.3.0:47282: EOF
2019/09/05 05:09:26 http: TLS handshake error from 10.1.3.0:56672: EOF
2019/09/05 05:09:26 http: TLS handshake error from 10.1.3.0:34906: EOF
2019/09/05 05:09:27 http: TLS handshake error from 10.1.3.0:58810: EOF
2019/09/05 05:09:36 http: TLS handshake error from 10.1.3.0:2974: EOF
2019/09/05 05:10:26 http: TLS handshake error from 10.1.3.0:47334: EOF
We also notice the bosh dns process is not responding to dns requests anymore:
dig +tcp +time=10 +tries=1 -p 53 upcheck.bosh-dns. @169.254.0.2
;; Connection to 169.254.0.2#53(169.254.0.2) for upcheck.bosh-dns. failed: connection refused.
Monit however thinks everything is fine:
monit summary
The Monit daemon 5.2.5 uptime: 19h 34m
Process 'garden' running
Process 'rep' running
Process 'route_emitter' running
Process 'netmon' running
Process 'vxlan-policy-agent' Does not exist
Process 'silk-daemon' running
Process 'loggr-udp-forwarder' running
Process 'loggregator_agent' running
Process 'loggr-forwarder-agent' running
Process 'prom_scraper' running
Process 'bosh-dns' running
Process 'bosh-dns-resolvconf' running
Process 'bosh-dns-healthcheck' running
System 'system_localhost' running
At the time of our investigation the health checks already stopped (last one was 2 hours ago) but still bosh dns was not responding. Restarting bosh-dns with monit "fixed" the issue.
on other nodes I found bosh-dns being up but monit gave a status of execution failed. Monit logs did not say way it was in this state:
[UTC Sep 5 05:14:02] error : 'bosh-dns' process with pid 3057 is a zombie
[UTC Sep 5 05:15:05] error : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep 5 05:15:35] error : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep 5 05:15:35] error : Alert handler failed, retry scheduled for next cycle
[UTC Sep 5 05:15:50] error : 'bosh-dns' process is not running
[UTC Sep 5 05:16:10] error : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep 5 05:16:26] info : 'bosh-dns' trying to restart
[UTC Sep 5 05:16:26] info : 'bosh-dns' start: /var/vcap/jobs/bosh-dns/bin/bosh_dns_ctl
[UTC Sep 5 05:18:42] error : 'bosh-dns' failed to start
[UTC Sep 5 05:19:14] error : Sendmail: error receiving data from the mailserver 'localhost' -- Resource temporarily unavailable
[UTC Sep 5 05:19:24] info : 'bosh-dns' process is running with pid 3560
[UTC Sep 5 05:19:24] info : 'bosh-dns' check process state succeeded
But does show it detected a zombie 'bosh-dns' process with pid 3057 is a zombie
.
Further looking into this we suspect, that the root cause could be in the configured timeouts and intervals for heathcheck and retry:
http/client
timeouts are configured hereOn a VM with this issue, where the CPU load is always on 100% we observed following: There are too many healthcheck connections in close wait
netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
1 established)
1 Foreign
2 SYN_SENT
26 LISTEN
35 TIME_WAIT
55 SYN_RECV
747 ESTABLISHED
3699 CLOSE_WAIT
On a different VM, which tries to get the health status of the VM under load we observe too many parallel healthcheck requests:
Every 0.1s: ss -t -a | grep 10.0.1.2
SYN-SENT 0 1 10.0.1.140730 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140434 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140522 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140392 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140436 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140724 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140004 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140718 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140092 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140716 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140246 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140720 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140394 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140792 10.0.1.2:8853
SYN-SENT 0 1 10.0.1.140396 10.0.1.2:8853
Our theory is that the VM with the issue is under high load and can't handle the connections. So we observed manly connections in close wait on that VM:
Every 0.1s: ss -t -a | grep 10.0.1.1
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:57316
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:56546
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:59310
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:58100
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:57308
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:57298
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:57310
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:57314
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:57318
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:53518
CLOSE-WAIT 1 0 10.0.1.2:8853 10.0.1.1:33152
The open file handle limits for the healthcheck process, where 4096
. Here we observe the too many open files
error, which is a limit how many requests the server can process. We actually increased it to 20000
, but this didn't help to resolve the situation. All file handles were taken after short period of time.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/168934149
The labels on this github issue will be updated when the story is started.
Hey @beyhan @rkoster @videlov ,
We committed a potential fix. Can you try to see if that alleviates the symptoms you see?
Thanks, @christarazi @jaresty
With the fix we see that the file handles usage drops. We are not sure wether it will solve our issue completely. Further improvements could be:
/health
request errors make sense here - https://github.com/cloudfoundry/bosh-dns-release/blob/7ccbf00958d6ca50060495722d7ae29d78ed41d6/src/bosh-dns/dns/server/healthiness/health_checker.go#L34cc @beyhan
Configurable ulimts could also make sense https://github.com/cloudfoundry/bosh-dns-release/blob/c2eb9e317e445adc65b8159b14c16fbbd9a4f73a/jobs/bosh-dns/templates/bosh_dns_health_ctl.erb#L44
Hi @christarazi @jaresty , we are going to test your changes and report back to you.
When do you plan to cut a new release for this repo? This removes the burden of maintaining a fork for us.
Thanks, @BeckerMax @videlov
If the fix seems good to you then we can cut a release. Please let us know how it goes. Thanks!
Hi @jaresty,
We validated the provided fix and it doesn't fix the issue. We don't see any change in the behavior of the health
endpoint. Our understanding why we get into this situation is as following:
health
endpoint can't respond to the requests for some time (e.g. because of an update) health
endpoint is responsive again, the queue of requests is so long, that it can't handle it, because the CPU overhead for a mTLS handshake is quite high.A solution could be to avoid multiple parallel check requests against the same target health server from the same client VM. The idea is to have a single active request to a health endpoint not multiple in parallel, because the requests are quite expensive with mTLS. Does this make sense?
cc @videlov
Hi @beyhan -
Thanks for checking this out for us. Your suggestion makes sense based on the behavior you described above. I am a bit surprised that it is possible for the requests to stack up in that way given that the timeout (5s) is less than the check interval (20s). I would have expected these health check requests to timeout instead of stacking up as described. Can you share a bit more about your investigation so we can understand how to reproduce it locally?
Thanks, Josh
Steps to reproduce the issue on a machine with up to 4 CPUs and 6 GB RAM:
bosh-lite
with the steps from https://bosh.io/docs/bosh-lite/zookeeper
as described in https://bosh.io/docs/bosh-lite/#deploy with 10 instances instead of 5iptables -A INPUT -p tcp --destination-port 8853 -j DROP
to simulate an update.unknown
. You may check from another zookeeper instance with: /var/vcap/jobs/bosh-dns/bin/cli instances
bosh is -d zookeeper | grep -v <instance-id> | grep -v smoke-tests | awk '{print $1}' | xargs -I {} -P 9 bosh -d zookeeper ssh {} -c "for i in {1..100}; do printf 'dig q-s3-i<instance-bosh-index>.zookeeper.*.*.bosh; echo %s;\\n' {0..200} | xargs -P 200 -I % sh -c '%'; done"
iptables -D INPUT -p tcp --destination-port 8853 -j DROP
Check that the instance gets too many request and reaches file handles limit:
tail -f /var/vcap/sys/log/bosh-dns/bosh_dns_health.std*
will show many error, e.g.:
2019/10/28 12:32:58 http: Accept error: accept tcp 0.0.0.0:8853: accept4: too many open files; retrying in 5ms
2019/10/28 12:33:19 http: TLS handshake error from 10.244.0.4:47990: EOF
lsof -p <dns-health-process-pid> | wc -l
netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
The instance health status should stay in unknown
state and the dig
dns queries should not be
able to resolve this instance.
cc @beyhan
Thanks @videlov for the detailed reproduction steps. Using these steps I was indeed able to reproduce the error described. I've submitted a fix which seems to resolve the issue based on the reproduction steps you shared. Can you try it out and let us know how it goes?
Hi Jaresty, To which version / tile is the fixed applied ?
Thanks.
Kelvin
I can confirm that the provided fix reduces the request load and works fine. Clients do now one request towards unhealthy instances or four in parallel, when a retry happens.
We observe now higher CPU load with the dns version v1.15.0
plus the fix for request batching compared to the old version 1.12.0
, which we head (Unfortunately, we updated also the dns version, which makes isolation of the performance regression more complicated, that is why I'm adding the comment here, but we can open a different issue for it). By higher CPU load I mean following:
dns-health
process in version v1.15.0
plus the request batching fix uses 30% to 40% CPU for 300 to 400 incoming req/s dns-health
process in version v1.12.0
uses 10% to 20% CPU for 300 to 400 incoming req/s Can you thing of any change, which can cause this? Do you also see any change in the CPU usage with v1.15.0
compared to older versions?
I have opened issue https://github.com/cloudfoundry/bosh-dns-release/issues/52 for the performance regression in the health endpoint. I verified, that the performance issue is not related to the fix provided here for request batching and was introduced with version v1.15.0
.
@beyhan From my understanding the fix provided above fixes this issue by batching requests to the same health endpoint. Do we still see further improvement options like different retry patterns or using connections with keep_alive to avoid expensive certificate evaluation? @jaresty What do you think about the latter?
Reusing connections will reduce the CPU load a lot, because as you can see in this issue the health-server spends a lot of time with handshakes.
I think that we don't need retries, because the check interval is quite low 20s
. Avoiding retries will only help in situation, where the health server is overloaded. It doesn't help in a normal situation, where the server can handle the load.
I don't see potential for other improvements at the moment.
I've made several changes that were touched on in this thread: https://github.com/cloudfoundry/bosh-dns-release/compare/383808ccc7f5e5fe72fc44fe936ed14961d6a15e...de827faf9736a5ecd43810636234e06025946359
health.request_timeout, health.local_health_interval, health.remote_health_interval, health.synchronous_check_timeout are now configurable in job spec. (https://www.pivotaltracker.com/story/show/170479480)
DNS lookups will no longer trigger a health request if cached health information is already available (it will be refreshed on the next remote_health_interval
) (https://www.pivotaltracker.com/story/show/170506259)
remove network retries on each remote_health_interval
health requests (https://www.pivotaltracker.com/story/show/170158998)
log health responses at DEBUG level (https://www.pivotaltracker.com/story/show/170477655)
In my small scale testing (12 n1-standard-4 VMs spamming each other with 250 instance group lookups) overall CPU usage dropped from 150% to 30% compared to 1.16.
I didn't look at file handle/ulimit changes, but now that health checks scale with number of domains instead of number of requests that bottleneck is hopefully relieved.
Build candidate https://s3.amazonaws.com/bosh-dns-release-candidate-tarballs/bosh-dns-dev-release.tgz?versionId=ihp_vYdf999PgoR2xiSAjyNvQVLqrD40
Thanks! We will give it a try!
Delivered in v1.17.0
Symptoms: On our big landscapes (1000 instances), when rotating dns_healthcheck certificates (step 1 of the guide https://bosh.io/docs/dns/#rotating-dns-certificates) the api instances of cf go up and stay at 100% cpu usage.
The cpu is used by the bosh-dns-healthcheck process.
Output from bosh_dns_health.stderr.log on the high cpu vm:
Analysis: It seems that other vms try to contact the api instances to check their health. For some reason that fails with a TLS Handshake error. Other vms don't wait but retry immediately. Client 10.0.65.98 for example tries to connect 3 times in the same second. This all is strange as step 1 only concatenates CAs we would expect TLS connections to succeed.
Workaround: The vms stay on the high cpu usage for several hours. After half a day the usage slowly goes down by itself. Alternatively, restarting the bosh-dns-healthcheck process solves it.
Question: Are you aware of such an issue, and how could we solve / prevent this from happening in the future?
Coding: Should we log errors when the health check requests fail? here