airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

Rate limit reporter updates #117

Closed panchr closed 5 years ago

panchr commented 5 years ago

Problem

When a service is flappy (i.e. changes healthy vs. unhealthy status frequently) it will have high write throughput to the reporter because each status change is reported.

Solution

Use rate limiting in order to throttle the reporter updates. The average rate/maximum burst is configurable, and rate limiting is off by default.

Testing

Added unit tests for both the RateLimiter class and its integration into ServiceWatcher. The average rate and maximum burst are tested, as well as checking that reports are not throttled when the rate limiter is disabled.

Also tested using mango-test:

Rate-limiting enabled

For this test, I configured rate-limiting to only allow 1 report per 10 seconds. As shown, after the reports are no longer throttled (i.e. after 10 seconds from the previous report), the new status is reported.

----> I, [2019-09-25T21:22:27.085264 #4004]  INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up
W, [2019-09-25T21:22:31.131430 #4004]  WARN -- Nerve::ServiceCheck::HttpServiceCheck: nerve: check http-127.0.0.1:80/health got response code 502 with body "<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
"
...
I, [2019-09-25T21:22:31.633595 #4004]  INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to down after 2 failures
W, [2019-09-25T21:22:31.633696 #4004]  WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: true)
...
----> W, [2019-09-25T21:22:37.159856 #4004]  WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now down
I, [2019-09-25T21:22:39.225242 #4004]  INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to up after 2 successes
W, [2019-09-25T21:22:39.225375 #4004]  WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: true)
W, [2019-09-25T21:22:39.731685 #4004]  WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: true)
...
----> I, [2019-09-25T21:22:47.349001 #4004]  INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up

From metrics, the number of reports is far lower than the number of attempted reports (which were throttled):

image image

Rate-limiting disabled

With rate-limiting disabled, logs and metrics are still released (when a report would have been throttled) which you can see here. However, the report still goes through.

I, [2019-09-26T18:03:47.677764 #19718]  INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to up after 2 successes
----> W, [2019-09-26T18:03:47.677878 #19718]  WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: false)
----> I, [2019-09-26T18:03:47.681427 #19718]  INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up
W, [2019-09-26T18:03:53.693957 #19718]  WARN -- Nerve::ServiceCheck::HttpServiceCheck: nerve: check http-127.0.0.1:80/health got response code 502 with body "<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
"
I, [2019-09-26T18:03:53.694049 #19718]  INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to down after 2 failures
W, [2019-09-26T18:03:53.696531 #19718]  WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now down
…
I, [2019-09-26T18:04:01.764030 #19718]  INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to up after 2 successes
----> W, [2019-09-26T18:04:01.764150 #19718]  WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: false)
----> I, [2019-09-26T18:04:01.768049 #19718]  INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up

With rate-limiting disabled, the number of reports (for up+down) is equal to the number of "throttles" (because no report is actually blocked):

image image

Reviewers

@anson627 @Jason-Jian @austin-zhu