braintree / litmus_paper

Backend health tester for HA Services
MIT License
32 stars 15 forks source link

Add internet health metric #14

Closed lollipopman closed 7 years ago

lollipopman commented 7 years ago

This metric provides a simple of heuristic of how healthy an ISP connection appears to be. You provide it a list of host and ports:

LitmusPaper::Metric::InternetHealth.new([ "cloud.google.com:443", "azure.microsoft.com:443", "aws.amazon.com:443", ])

And the check performs a TCP connect to each host and port. The metric then returns a number between 0 and 100 indicating the percentage of host which are reachable.

lollipopman commented 7 years ago

@ssgelm @mvallaly @dkuntz2 would love any feedback on this request

zdzolton commented 7 years ago

You might want to try a number of regional DC domains, for each cloud provider... Especially since aws.amazon.com routes to AWS's us-east-1 DC, which has their worst availability, you might generate more noise than desired.

zdzolton commented 7 years ago

I do see that it reports a health score proportionate to the number of hosts it can reach, which should provide a better signal than an all-or-nothing check.

lollipopman commented 7 years ago

@zdzolton I agree that a better geographically dispersed sample would be better

lollipopman commented 7 years ago

@dpirotte made the observation that timeout in ruby 1.9 is known to have broken corner cases. There are a variety of workarounds, http://stackoverflow.com/a/21014439/1236063, however the existing code has proven successful in litmus paper's use case as show by our many years of use in the tcp dependency. So rather than blinding incorporating a stackoverflow patch, I would leave as is and switch to the the tcp socket timeout available in ruby versions 2.0 and greater, when we deprecate support for ruby 1.9.