SuperQ / chrony_exporter

Exporter for Chrony NTP
Apache License 2.0
41 stars 11 forks source link

Expose server reachability #75

Closed hhoffstaette closed 2 months ago

hhoffstaette commented 2 months ago

Based on a recent mailing list thread I'd like to propose the exposure of an additional metric.

The original chrony client shows the reachability of upstream servers and this can be used to detect changes in network topology without having to rely on time/clock drift for failure detection.

Having this field exposed would make it easy to create e.g. an alertmanager rule.

SuperQ commented 2 months ago

From the chrony docs

This shows the source’s reachability register printed as an octal number. The register has 8 bits and is updated on every received or missed packet from the source. A value of 377 indicates that a valid reply was received for all from the last eight transmissions.

The facebook/time library returns this value as a uint16.

While there's some interesting things we could do with the bits, the most useful thing that comes to mind would be to compute the ratio of 1 to 0 bits in the value as a ratio. So if all probes fail the metrics is 0.0. If all probes are passing, 1.0.

Another option would be to only expose the current bit, since Prometheus is polling typically faster than NTP packets are sent, we could represent the "last reach success" as a simple bool.

The question is, how is that register updated, shift left? shift right?

The next useful option would be to expose the bits directly as a state set. While this would provide the full bit detail, it's a bit high cardinality.

As for the easy option, exposing the raw byte directly as a value, this seems less useful for monitoring, as you would have to interpret the bits in PromQL for the alert to be useful. I would say this is better mapped in the exporter's code.

hhoffstaette commented 2 months ago

The reachability is available in the SourceData. However I now wonder if this is more useful/better than just alerting on chrony_sources_state_info which already exposes the source state in a human-readable way.

Edit: a simple chrony_sources_state_info{source_state != "sync"} (for say >=10m) alert would have been enough to prevent the problem reported on the mailing list.

hhoffstaette commented 2 months ago

Flattening the value to a binary last_reach_success is probably too fragile against network hiccups - packets can and do get lost even on inhouse networks (resync, reboots and whatnot).

Hmm.. maybe this wasn't such a useful idea at all :sweat_smile:

SuperQ commented 2 months ago

Chrony's default minpoll is 6, which is 2^6 seconds (64 seconds) with a maxpoll of 10 (2^10 = 1024 seconds). So even if you have a scrape interval of 60s, you'll always catch the last reach bit.

NTP is a very low packet count protocol.

SuperQ commented 2 months ago

Did some local testing.