Router VM starts to be very slow when gorouter generates a lot of logs

giner commented 7 years ago

Issue

Router VM starts to be very slow when gorouter generates a lot of logs. This can lead to a situation when a single broken application can make routing layer fail.

Context

The situation when gorouter generates a lot of logs.

Steps to Reproduce

Deploy an application which does not accept requests fast (sleep 5 seconds before accepting a tcp connection).
Run load test against this app

Expected result

Load on a router VM slightly increases

Current result

Load on a router VM highly increases

Possible Fix

This happens because of the following part of bash code https://github.com/cloudfoundry-incubator/routing-release/blob/0.152.0/src/routing_utils/syslog_utils.sh#L15-L16. read makes a syscall for every single character and date is run for every log line.

To solve the issue prepend_datetime should be replaced by something like ts, awk or anything else which is faster than Bash's read.

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/144602211

The labels on this github issue will be updated when the story is started.

shashwathi commented 7 years ago

@giner : Thank you for submitting this issue. Over the time, we have copied the utils script from other teams and not really wondered whether it was efficient bash scripting or not. We will definitely take a look at the alternatives that are suggested in this issue to reduce CPU cycles.

@abbyachau @shalako : Feel free to prioritize this.

Regards Shash

CAFxX commented 7 years ago

If ts is available the whole prepend_datetime function can be replaced by ts "%Y-%m-%d %H:%M:%S%z" (I'm assuming the sed is required just to undo the effect of read -r, but I'm not 100% positive about this)

giner commented 7 years ago

The following works for me:

[lab-openstack router_dev_z1/0 10.73.3.3] rpaas-admin:~
$ diff -u /var/vcap/packages/routing_utils/syslog_utils.sh{.orig,}
--- /var/vcap/packages/routing_utils/syslog_utils.sh.orig       2016-11-03 18:35:43.000000000 +0000
+++ /var/vcap/packages/routing_utils/syslog_utils.sh    2017-07-15 21:42:11.198945353 +0000
@@ -7,12 +7,6 @@
   declare log_dir="$1"
   declare log_name="$2"

-  exec > >(tee -a >(logger -p user.info -t "vcap.${log_name}.stdout") | prepend_datetime >>"${log_dir}/${log_name}.log")
-  exec 2> >(tee -a >(logger -p user.error -t "vcap.${log_name}.stderr") | prepend_datetime >>"${log_dir}/${log_name}.err.log")
-}
-
-function prepend_datetime() {
-  while read -r line; do
-    echo "[`date +\"%Y-%m-%d %H:%M:%S%z\"`] $line" | sed s/\\\\n//
-  done
+  exec > >(tee -a >(logger -p user.info -t "vcap.${log_name}.stdout") | ts "[%Y-%m-%d %H:%M:%S%z]" >>"${log_dir}/${log_name}.log")
+  exec 2> >(tee -a >(logger -p user.error -t "vcap.${log_name}.stderr") | ts "[%Y-%m-%d %H:%M:%S%z]" >>"${log_dir}/${log_name}.err.log")
 }

shalako commented 7 years ago

@giner @CAFxX

Can you quantify your performance observations before and after the proposed change? What was your load profile (concurrency, req/resp size)? How many routes in the routing table?

Thank you for the suggestion. Prioritizing for implementation.

giner commented 7 years ago

There is now easy way to quantify, it just goes out of control, load average goes higher than number of cores (while CPU utilization may barely reach 50%) and the whole system becomes unstable. But this only happens when gorouter generates lots of logs to stdout / stderr. And this can be triggered by misbehaving application.

CAFxX commented 7 years ago

Just to give some context:

when the behavior @giner is talking about happens the cpu load mostly is generated in the bash process that runs that loop, not form gorouter/metron or any other process running on the vm
we have seen this mostly in our non-prod router "segments" (we haven't migrated to isolation segments yet, we use ACLs to define segments) whose configuration is the same as prod segments but with significantly less load
important note: we don't emit access logs via stdout (only via metron), so basically the only logs that gorouter emits are error logs (see below as to why this is important)
I don't have hard numbers about number of req/sec last time we saw this, but the number of routes was somewhere in the range 600-700. although we don't have absolute numbers about the number of req/sec, what we noticed IIRC is that there was no big increase w.r.t. normal traffic levels; what was unusual was that one of the apps was accepting connections and requests but never responding, raising the number of open connections (until they timed out). This broken application was what was causing the router to generate a lot of error logs.

shalako commented 7 years ago

Hello @giner @CAFxX

Thank you for the additional context. It sounds like you had an unresponsive app that resulted in many concurrent connections and generated more errors in the gorouter.log

In our latest release we now offer the manifest property router.backends.max_conns with which you can limit the concurrent request per backend. This may help with the scale of the behavior you observed.

With regard to the performance of the script that prepends the timestamp, we have concerns about using ts.

It has a GPL license
As it doesn't come with the stemcell, we'd have to explore how to be notified promptly of CVEs

We could likely find a solution to 2 but 1 is a problem. We'd have to get approval from the CFF legal team. I will also inquire as to whether the stemcell team would consider adding ts. In the past, they have been resistant to add packages that do not come with the standard distribution for reason 2.

It has also been suggested to me that we could improve performance even more if we stopped prepending the timestamp entirely, and the gorouter may be among a few components in CF that does this. I was surprised to hear it; I thought CC and others do the same. While I can do some research to determine if there is a convention and what guidance others on the project would provide, I am skeptical about this strategy because:

I believe the human-readable timestamp to be more valuable
Removing the timestamp may be a breaking change for aggregation services that expect it.

giner commented 7 years ago

What about one of the following?

while true; do echo $a; let a++; sleep 0.1; done |
    gawk '{ print strftime("[%Y-%m-%d %H:%M:%S%z]"), $0; fflush(); }'

while true; do echo $a; let a++; sleep 0.1; done |
    perl -ne 'BEGIN { use POSIX strftime; STDOUT->autoflush(1) }; my $time = strftime("[%Y-%m-%d %H:%M:%S%z]", localtime); print("$time $_")'

CAFxX commented 7 years ago

As @giner says, if we really can't use ts (more on that below), at least let's opt for a solution that doesn't fork 3 processes per log line (as the current fix does)

With regard to the performance of the script that prepends the timestamp, we have concerns about using ts.

It has a GPL license

As it doesn't come with the stemcell, we'd have to explore how to be notified promptly of CVEs

We could likely find a solution to 2 but 1 is a problem. We'd have to get approval from the CFF legal team. I will also inquire as to whether the stemcell team would consider adding ts. In the past, they have been resistant to add packages that do not come with the standard distribution for reason 2.

I may be missing something... 2 seems to me be trivially fixable by including the ubuntu moreutils package (that includes ts) in the stemcell and about 1 I think the stemcell already includes plenty of GPL components, so I can't really imagine why adding an additional GPL component would be much of a problem.

shalako commented 7 years ago

Thank you for all your feedback. I've checked with the BOSH team and they are resistant to pulling in the moreutils package.

We have received guidance that all component teams should stop using the tee-to-logger pattern and that operators should colocate syslog-release to handle streaming of syslog. For more info, see https://github.com/cloudfoundry/exemplar-release#exporting-logs

If you can suggest a cheap solution in the short term that does not introduce additional dependencies, great. Otherwise, I suggest we move in the direction of removing support for syslog forwarding in favor of syslog-release. Maybe a property that operators can opt into that disables our syslog behavior?

emalm commented 7 years ago

@shalako Now that the gorouter is using the github.com/uber-go/zap library for logging, if you want more human-readable timestamps in the structured logs you could configure it to use an ISO8601TimeEncoder.

giner commented 7 years ago

If you can suggest a cheap solution in the short term that does not introduce additional dependencies, great.

gawk and perl examples above

giner commented 7 years ago

I have realized that stemcell doesn't have GNU awk either. But it still has perl so that the following would work (TESTED):

$ diff -u /var/vcap/packages/routing_utils/syslog_utils.sh{.orig,}
--- /var/vcap/packages/routing_utils/syslog_utils.sh.orig       2017-09-25 19:50:47.196115412 +0000
+++ /var/vcap/packages/routing_utils/syslog_utils.sh    2017-09-25 18:55:31.914603974 +0000
@@ -12,7 +12,5 @@
 }

 function prepend_datetime() {
-  while read -r line; do
-    echo "[`date +\"%Y-%m-%d %H:%M:%S%z\"`] $line" | sed s/\\\\n//
-  done
+  perl -ne 'BEGIN { use POSIX strftime; STDOUT->autoflush(1) }; my $time = strftime("[%Y-%m-%d %H:%M:%S%z]", localtime); print("$time $_")'
 }

giner commented 7 years ago

Here is one more case when this issue becomes a big problem. If a route which receives high number of requests disappears, load on router VMs goes very high until the route is back. For example this is the case when someone updates or restages an app in place without doing blue-green. Also this is an easy vector for DOS attack.

Here is an example:

# Running siege to simulate 300 concurrent connections against non-existent route for 5 minutes
siege -t 5m -c 300 http://test.dev.mycf.net

gorouter_load

Notes:

there are 3 routers in total
router_dev_z1/0 has a patch applied (perl-based).

shalako commented 7 years ago

Thank you for doing this exploration. Prioritizing for eng review.

jberkhahn commented 7 years ago

Thanks for the update, this looks good. We added it to routing-release.

Thanks, @jberkhahn & @flawedmatrix CF Routing Team

cloudfoundry / routing-release