When a service is flappy (i.e. changes healthy vs. unhealthy status frequently) it will have high write throughput to the reporter because each status change is reported.
Solution
Use rate limiting in order to throttle the reporter updates. The average rate/maximum burst is configurable, and rate limiting is off by default.
Testing
Added unit tests for both the RateLimiter class and its integration into ServiceWatcher. The average rate and maximum burst are tested, as well as checking that reports are not throttled when the rate limiter is disabled.
Also tested using mango-test:
Rate-limiting enabled
For this test, I configured rate-limiting to only allow 1 report per 10 seconds. As shown, after the reports are no longer throttled (i.e. after 10 seconds from the previous report), the new status is reported.
----> I, [2019-09-25T21:22:27.085264 #4004] INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up
W, [2019-09-25T21:22:31.131430 #4004] WARN -- Nerve::ServiceCheck::HttpServiceCheck: nerve: check http-127.0.0.1:80/health got response code 502 with body "<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
"
...
I, [2019-09-25T21:22:31.633595 #4004] INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to down after 2 failures
W, [2019-09-25T21:22:31.633696 #4004] WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: true)
...
----> W, [2019-09-25T21:22:37.159856 #4004] WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now down
I, [2019-09-25T21:22:39.225242 #4004] INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to up after 2 successes
W, [2019-09-25T21:22:39.225375 #4004] WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: true)
W, [2019-09-25T21:22:39.731685 #4004] WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: true)
...
----> I, [2019-09-25T21:22:47.349001 #4004] INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up
From metrics, the number of reports is far lower than the number of attempted reports (which were throttled):
Rate-limiting disabled
With rate-limiting disabled, logs and metrics are still released (when a report would have been throttled) which you can see here. However, the report still goes through.
I, [2019-09-26T18:03:47.677764 #19718] INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to up after 2 successes
----> W, [2019-09-26T18:03:47.677878 #19718] WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: false)
----> I, [2019-09-26T18:03:47.681427 #19718] INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up
W, [2019-09-26T18:03:53.693957 #19718] WARN -- Nerve::ServiceCheck::HttpServiceCheck: nerve: check http-127.0.0.1:80/health got response code 502 with body "<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
"
I, [2019-09-26T18:03:53.694049 #19718] INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to down after 2 failures
W, [2019-09-26T18:03:53.696531 #19718] WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now down
…
I, [2019-09-26T18:04:01.764030 #19718] INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-127.0.0.1:80/health transitions to up after 2 successes
----> W, [2019-09-26T18:04:01.764150 #19718] WARN -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure throttled (rate limiter enabled: false)
----> I, [2019-09-26T18:04:01.768049 #19718] INFO -- Nerve::ServiceWatcher: nerve: service mango-test_26009_secure is now up
With rate-limiting disabled, the number of reports (for up+down) is equal to the number of "throttles" (because no report is actually blocked):
Problem
When a service is flappy (i.e. changes healthy vs. unhealthy status frequently) it will have high write throughput to the reporter because each status change is reported.
Solution
Use rate limiting in order to throttle the reporter updates. The average rate/maximum burst is configurable, and rate limiting is off by default.
Testing
Added unit tests for both the
RateLimiter
class and its integration intoServiceWatcher
. The average rate and maximum burst are tested, as well as checking that reports are not throttled when the rate limiter is disabled.Also tested using
mango-test
:Rate-limiting enabled
For this test, I configured rate-limiting to only allow 1 report per 10 seconds. As shown, after the reports are no longer throttled (i.e. after 10 seconds from the previous report), the new status is reported.
From metrics, the number of reports is far lower than the number of attempted reports (which were throttled):
Rate-limiting disabled
With rate-limiting disabled, logs and metrics are still released (when a report would have been throttled) which you can see here. However, the report still goes through.
With rate-limiting disabled, the number of reports (for up+down) is equal to the number of "throttles" (because no report is actually blocked):
Reviewers
@anson627 @Jason-Jian @austin-zhu