Add healthtest feature to dynamically add/remove DNS entries

abligh commented 9 years ago

This is an experimental feature to add/remove DNS entries depending on whether servers/services are up/down. Initially NTP testing and TCP port testing are supported.

This is an experimental lightly tested patch. It's based on top of the other 3 merge requests I have submitted recently, but is pretty easily separated.

From the commit message:

commit 049d4acdfd1b5941feb501ddb1353a27e464047d
Author: Alex Bligh <alex@alex.org.uk>
Date:   Mon Aug 17 15:32:26 2015 +0100

Add healthtest functionality to automatically exclude down hosts

Add a healthtest function activated by setting the 'test' attribute on a label.
When a test is specified, each RR within the label (A and AAAA only at this
stage) is polled regularly with a configurable test. Current configurable
tests are that a tcp port can be opened, or that an NTP response within a
given stratum range is achieved. RRs that fail the test will be excluded
from any results.

Health tests can be configured as follows:

     "test" :  {
         "type" : "ntp",
         "frequency" : 30,
         "retry_time" : 5,
         "retries" : 3,
         "timeout" : 5,
         "max_stratum" : "3"
      }

or

     "test" :  {
         "type" : "tcp",
         "frequency" : 30,
         "retry_time" : 5,
         "retries" : 3,
         "timeout" : 5,
     "port" : 80
      }

Attributes are as follows:

* type: specifies type of test (currently "ntp" or "tcp")

* frequency: specifies time in seconds between polls if the server is up

* retry_time: specifies time in seconds between polls if a poll fails

* retries: number of failed polls required to consider a server as down

* timeout: timeout on each of the polls

* max_stratum: (ntp only) maximum ntp stratum number for the poll
  to be considered successful

* port: (tcp only) tcp port number to connect to

Signed-off-by: Alex Bligh <alex@alex.org.uk>

abligh commented 9 years ago

Rebased onto fix for #74

abh commented 9 years ago

Hi Alex,

This (and the other patches) is great, thank you! I'm a bit backlogged and need to spend my hobby time on some other stuff in the immediate future, but I'm looking forward to reviewing and integrating this!

Ask

abligh commented 9 years ago

Ask,

Great. Happy to tweak them if need be.

Alex

abligh commented 9 years ago

Rebased onto #77, #71, #72

abligh commented 9 years ago

Rebased

abh commented 9 years ago

Alright, I’ve been thinking about this for a while. I’m really happy about this work, clearly it’s something GeoDNS needs to be more generally useful.

However — I don’t think the health checking itself should be inside the DNS daemon. We can’t make health checks for every protocol, and in some deployments it doesn’t make sense to have every DNS server do health checks (or the reverse).

It’d be better to just have a file format for “health check results” that GeoDNS knows how to load and then a separate daemon to do the checks; then others could do something to get health check results from nagios/pingdom/whatever they use as appropriate.

abligh commented 9 years ago

Ask,

Having thought about it, I disagree with your last paragraph:

Whilst you are right that we can't cover every case, the existing tests cover many cases as tcp covers loads. Specifically http, https, smtp (plus secure variants), as well as ntp. If we recognise for a minute that covering dns would be silly (you need anycast for that), we've covered a very large percentage of traffic on the internet, and much more would be covered by the tcp tests. Can you think of any major use-cases which aren't covered?
The tests are small (one simple test function, and something to stringify parameters). If there are any universally useful ones that would be useful, I'd be happy to write them and/or others could easily. I could put the tests in separate files and write some documentation of how to do this if useful. The only one I've thought of so far as an SSL test that actually carries out the negotiation. In practice though it's uncommon for an SSL service to fail in a manner where it still responds on the port and this just loads up the server. It would be all of 5 minutes work to write though.
Reading / parsing a file is going to be more load in most cases than doing the test itself. And getting another daemon to start/stop simultaneously, handle its own logging etc. is itself a pain. This was why I wrote it this way rather than post-process the JSON files.
That said, we're not going to be able to anticipate all use cases, and I can see the advantage of loading results from a file (or just detecting the presence of a file per test, which would be far quicker). But I can just add these as two more test types! That way we have the best of both worlds. Happy to do so - as I say it's about 5 minutes work. Combine this with the ease of adding tests yourself, and I think we have everything covered.

WDYT?

abh commented 9 years ago

I definitely see your argument, but I think your message also implies some agreement with mine (you could quickly think of a couple more tests). I’m more concerned about the slippery slope of GeoDNS suddenly having more code and complexity for doing health checks than for … DNS stuff. Even your SSL example could quickly get requests for accepting invalid certificates, specifying a client cert, etc etc.

(As an aside I actually do use GeoDNS for DNS servers (“g.ntpns.org”) — I know that’s completely crazy and I don’t think GeoDNS should support this natively.)

I like the idea of having the external mechanism/API; and maybe I can be convinced that keeping a couple of simple checks in, too, is reasonable for ease of use for simple use cases, however I really think the external mechanism will be cleaner and easier to maintain for configurations past the simplest ones. Maybe just the TCP one? Really I think it’d be much better to extract the code you wrote into a separate daemon though.

Anyway, for the external health data:

As a starting point maybe have the label have something like

“test”: { “label”: “foo” }

or even just

“healthlabel”: “foo”

and then the health check file(s?) could be just something like

foo 1.2.3.4

to specify that 1.2.3.4 is an invalid answer.

abligh commented 9 years ago

I have reworked this to support

Existing health tests (tcp / ntp)
exec health test - run a command with an appropriate parameter
file health test - read a JSON file (as described by you)
nodeping health test - use Nodeping API
pingdom health test - use Pingdom API

abligh commented 9 years ago

Rebased

skyred commented 9 years ago

I am new to GeoDNS, so I couldn't follow up the discussion above. But, I'd like to bring up a use case for your consideration:

I want to use Measurement API from Ripe Atlas Project to dynamically change geodns records/weights. Here is one example, https://atlas.ripe.net/measurements/2400706/#!seismograph it pings Linode Tokyo datacenter every 5 minutes from 5 cities in China. I'd like GeoDNS to dynamically change multiple A records' weights for a given subdomain, based on packet loss rates.

abligh commented 9 years ago

@skyred if you want to add/remove A records according to packet loss, you can use the test framework above. If you want to change their weights (rather than just add/remove them), I can't think of a better way right now than rewriting the zone files. Write them to a .tmp file, then rename them over the original to get an atomic change and GeoDNS will pick them up. Allowing a healthcheck to specify a weight is an interesting idea though.

abligh commented 8 years ago

@abh quick ping - any movement on this one?

abligh commented 8 years ago

rebased

abligh commented 8 years ago

Rebased onto a branch including cc09d9d to avoid travis errors.

abh / geodns

Add healthtest feature to dynamically add/remove DNS entries #73