monitoring: expose Prometheus-friendly metrics

This is possibly a followup on https://github.com/bus1/dbus-broker/pull/220.

It would be nice to have some Prometheus-compatible ways to query internal metrics from dbus-broker.

For reference, Prometheus has its own exposition format which is basically a well-known datastructure over a plaintext response to an HTTP GET. While the transport and encoding are likely not useful here, the underlying datastructure is:

a map of metric_key -> metric_value, where:
- metric_value is a f64
- metric_key is metric name + map(label_key, label_value), where:
  - all components (metric_key, label_key, label_value) are strings
  - the labels map is optional

I'm somehow asking for an interface similar to org.freedesktop.DBus.Debug.Stats.GetStats(), but returning a datastructure equivalent to the one above or, even better, directly the Prometheus textual format.

My initial MVP for metrics to query here would be:

a gauge for the start timestamp of the process
a gauge for the number of active connections

For reference in case this looks very fuzzy, I have an unrelated service implementing something similar (minus the dbus part) and the result can be (temporarily) observed here.

Can you elaborate on the use-case? Which information would you want to be returned? Why do you want this information to be returned? Who processes this information? And especially who interprets it?

The usecase would be for monitoring dbus-broker on a fleet of server nodes part of a large cluster, where there is a centralized monitoring solution.

I am not looking for a specific piece of information at this point (the MVP above has two basic examples though). I'd want a stable way to monitor dbus-broker, then letting the developers (you) free to expose what they think is useful to track from a stability/performance/capacity point of view.

I want this metrics information to be exposed/returned, in order to do whitebox monitoring of dbus-broker. That is, asking dbus-broker directly about its relevant internal state, and tracking it over time. This interface would be periodically polled (e.g. every minute) and the result recorded externally.

This information is processed by the monitoring system (e.g. Prometheus) and is recorded in some time-series databases. It can be used for proactive alerting, performance tracking, post-mortem analysis, capacity planning, dashboarding, and more.

It is aggregated, interpreted and queried by the monitoring solution itself. That is, as long there is a standard way (see datastructure above) to retrieve those metrics, the rest of the logic is decoupled from this. Whatever monitoring solution can be used to consume this.

As a concrete example using the unrelated service above, these metrics can be used for cluster dashboarding and live performance/status querying.

I looked into Prometheus a bit more. In general, I like the concept of aggregating metrics. I am also fine with keeping close to Prometheus semantics, as it seems to be a quite established utility in that field.

I am, though, a bit worried about using the Prometheus data-format in dbus-broker. I'd be fine adding support for it to the launcher, or any other tools. However, the core broker implementation is currently a pure dbus implementation, that has no other external access whatsoever. Furthermore, the experiences with external formats integrated into D-Bus messages have not been very pleasant in the past (e.g., the mess with XML introspection data). Lastly, I am not very fond of the very lax parsing requirements of the text-based Prometheus format (why does it parse comments when they begin with magic keywords? Will more keywords be added in the future? Why does it use common-suffixes on user-selected identifiers (like _sum)? This is all quite shady and worrying for forward-compatibility).

Anyway, lets put my personal opinion on that aside. Fact is, dbus-broker has no external access other than the bus it manages, and the control connection to its launcher. Both are D-Bus marshaled transports. So the natural extension would be to add more native interfaces like you suggested with the Stats interface. This would allow easy access to metrics, integrate nicely with already existing access control (if required), and straightforward to implement and maintain.

With this in mind, it would be rather natural to return information nativly marshaled as D-Bus messages. A simply a{sv} as result object would be in-line with the other interfaces. This would mean there needs to be a transformation from this format into the Prometheus text format on the call-site. However, the caller already needs to parse the D-Bus message, anyway, so I think this would be acceptable. What do you think?

The best approach would be to agree with dbus-daemon developers on an interface. But they are very reluctant to accept interface definitions without an implementation in the reference-implementation (which is a legitimate position, I think). I am not very fond of implementing a metrics interface for both dbus-broker and dbus-daemon, so I would rather propose a dbus-broker private interface with the future option of aliasing it with an official org.freedesktop.DBus name. Anyway, how about this:

interface org.bus1.DBus.Metrics {
    Current() -> a{sv}
}

A very simple interface that allows to query the current metrics. This could then be easily extended with more data in the future. The labels can be easily converted into Prometheus labels.

I wondering, whether to include information about the service that is managed externally. For instance, the start-time sounds like something to query from the service-file, rather than from the running service. I mean, this is information not under control of dbus-broker, but only under control of its parent.

A simple set of metrics to start with would be total number of client-ids allocated, number of currently active connections, number of connections currently authenticating.

For everything we then add on top, we would have to discuss whether we can get the data without calculating anything at runtime.

Anyway, comments welcome! I am open to suggestions!

@dvdhrm ack, I see your point of being wary of locking yourself into an externally-defined exposition format, it's a legit one. Regarding metrics name, there are guidelines which are not mandatory but help avoiding a lot of common pitfalls.

Your suggested approach means there needs to be a smarter middleman/proxy as an exporting point for Prometheus; that's a totally legit pattern. I'd be fine with that.

Regarding the dbus signature, a{sv} looks fine as a minimal approach. I think you can even go the full-static-typing and trim that down to a a{sd} (i.e. explicit f64 values).

If you want to make s part a bit more rich / less stringly-typed, another approach is to further break labels apart into label pairs (this is how Prometheus protobuf encodes them). I think it would result in an overall Current signature like `a{a{ss}d}. Up to you how much complexity you want to have there, a single string is fine too as the lowest common denominator.

Regarding service info / start timestamp, you have a good point that this usually belongs to the service manager. However, I think duplicating it here too won't hurt and would make metrics analysis simpler. One difference though is that here it is pretty much a O(1) call, while getting the same through a systemd exporter requires walking the service hierarchy and filtering through properties.

bus1 / dbus-broker

monitoring: expose Prometheus-friendly metrics #228