Kong / lua-resty-dns-client

Lua DNS client, load balancer, and utility library
http://kong.github.io/lua-resty-dns-client/topics/README.md.html
Apache License 2.0
152 stars 52 forks source link

feat(balancer) prevent cascading failures #65

Closed Tieske closed 5 years ago

Tieske commented 5 years ago

Implements a health measure. The relative "weight" in the balancer being disabled.

The new property healthThreshold can be used to set the minimum amount of healthy weight required (in %). When the actual percentage gets under the threshold, the balancer will return errors.

The default value for healthThreshold == 0 hence default behaviour does not change. It is a breaking change though because the error message returned will be different, and the callback got a new event "health" whenever there is an update in health status.

fixes #25

EDIT: rewrote the description since the initial description only covered the 1st commit

Tieske commented 5 years ago

@hishamhm there are some test failures, but they are unrelated. Some test dns records were deleted (all SRV related), I'm working on fixing those.

Tieske commented 5 years ago

Note that the last commit fixes linter errors, most of which existed before this pr.

Tieske commented 5 years ago

@hishamhm after discussion I tried to find slightly more intelligent back-off mechanisms. Problem is that most of those require application knowledge/client side behaviour.

To improve the following could be implemented;

  1. apply a global retry-limit on a balancer (max 10 retries per second, or 200 per minute, etc). This would always be in effect.
  2. Drop relative traffic (so 20% unhealthy means randomly dropping 20% of traffic) and stop handling retries all together. Applicable from a first threshold, eg. health drops below 70%.
  3. Drop all traffic to stop the cascade. Applicable from a second threshold, eg. health drops below 50%.

This PR implements health metrics, and option 3 above. The other ones could be added later, since the third one is the most important (and most protective).

Tieske commented 5 years ago

@locao would you mind having a look at this one as well. This is the one @hishamhm mentioned as "controversial" because of the all-or-nothing approach when reaching the threshold.

But as stated, the default value for the threshold is to kick in when everything has failed, hence it does not change default behaviour. Only when explicitly setting the threshold.