Add observability metrics to OpenWEC server

vruello commented 2 weeks ago

Observability metrics need to be produced and exposed by the OpenWEC server.

Which metrics?

I think the following metrics would be interesting to have:

The number of HTTP requests received per second, total and per action (enumerate, events, heartbeat)
The response time of each HTTP requests per second, total and per action (enumerate, events, heartbeat)
The number of events received per second, total and per subscription
The number of events that could not be handled (because an output failed) per second (or bigger), total and per subscription. Would be helpful to detect if there is a problem with an output (for example if the file system is full).
The total number of machines seen per subscription (already covered by "openwec stats")
The number of active machines (received an event "recently") per subscription (already covered by "openwec stats")
The number of alive machines (received an heartbeat "recently") per subscription (already covered by "openwec stats")
The number of dead machines (didn't receive anything "recently") per subscription (already covered by "openwec stats")

From a developer's point of view, it would also be interesting to optionally add more timing metrics, for example to measure the amount of time spent in parts of the code. For example, when we receive a batch of events, it would be interesting to know how much time we spend decrypting, decompressing, parsing xml, formatting events, writing formatted events to each output, generating response and encrypting response.

Feel free to suggest other metrics!

Which protocol/format?

There are multiple ways to expose/transmit metrics. After a brief state of the art, I think we need to choose between:

OpenMetrics (https://github.com/prometheus/OpenMetrics) (pull-based), the protocol and format used by Prometheus, which seems to be the most "up to date" solution.
statsd (https://github.com/b/statsd_spec) (push-based)

Both have pros and cons:

OpenMetrics:
- Pros:
- No network overhead
- (maybe) the new standard?
- Integrates well in Kubernetes environments
- Cons:
  - Need to expose (another) HTTP server
  - Limited features (specially for timing measures)
statsd:
- Pros:
- Advanced features (because everything is calculated on the statsd server)
- Cons:
- (maybe) a lot of monitoring traffic?
- (maybe) impact on performances?

Which library?

statsd: Cadence
OpenMetrics/Prometheus: prometheus_client
both: metrics-rs

I'm currently working on a prototype with prometheus_client where the OpenWEC server would expose a HTTP server dedicated to metrics (different listening addr/port).

tarokkk commented 2 weeks ago

Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.

vruello commented 2 weeks ago

Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.

The Prometheus documentation states that "The remote write protocol is not intended for use by applications to push metrics to Prometheus remote-write-compatible receivers. It is intended that a Prometheus remote-write-compatible sender scrapes instrumented applications or exporters and sends remote write messages to a server." (https://prometheus.io/docs/specs/remote_write_spec/#background).

An application can push metrics to Prometheus Pushgateway (https://prometheus.io/docs/instrumenting/pushing/). However, the OpenWEC server does not seem to fit in the cases where the pushgateway should be used: https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway.

vruello commented 2 weeks ago

After a little more thought, I think metrics should be split in 2 categories:

metrics intended for users
metrics intended for developers

Metrics intended for developers should probably be provided using tracing (and perhaps tracing-opentelemetry). This would require some work to transition from logging/log4rs but it would give us a lot of powerful tools and features to analyze the behavior of the server in depth.

vruello commented 2 weeks ago

I have two working prototypes:

One with prometheus_client. I am not very happy with it. A lot of boilerplate code is needed to run the HTTP server needed to expose the metrics. Also, metrics "objects" (counters, histograms, ...) have to be passed into each function call which is annoying.
One with metrics and metrics-exporter-prometheus. I really like the idea that metrics are independent of the exporter (prometheus here). It is also very easy to use (macros like log crate). Maybe it is a bit slower than prometheus_client, but I don't think that it is significant.

I will test the second prototype in a real environment and see if it affects performance or not.

vruello commented 7 hours ago

After a few weeks of testing "in production", I am quite confident that the production of prometheus metrics using metrics-rs and metrics-exporter-prometheus has a negligible impact on performance.

The default buckets used to store the "request_duration" histogram are not really suitable for openwec. In our production environment, openwecd answers 50% of http requests in less than 1ms... I think it's fine to keep the "standard" (0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10) as the default, but it is probably better to change them in the configuration if you need an accurate view.

MrAnno commented 3 hours ago

metrics-rs looks pretty promising, helping manage the registry of metric objects without being too intrusive. It's also protocol-agnostic, which may come in handy later.

metrics intended for users The number of events received per second, total and per subscription

As far as I can see, openwec_event_received_total will be able to handle everything related to the input side of events, EPS will be just a rate() call on the Prometheus side. The only important thing for this metric is to create small enough partitions in order to allow informative queries on it later. High cardinality in partitions (labels) should be avoided though (especially if they are often active within a reasonable time range), because that would dramatically increase the stored data on the Prometheus side, and would also make queries slower.

All things considered, I think adding subscription to the labels will never be considered high cardinality and is crucial for providing helpful metrics. The next very important partition would be computer/machine/source (probably machine is the good name in the OpenWEC context), because even in the most basic use cases, one wants to know where the events are coming from, how many are there, which client is the loudest in a given time interval, or which group of clients is generating the most events. But depending on the use case, machine may become high cardinality, so I think that label should be configurable: on by default, but could be disabled in extremely big high-scale environments.

(Just random nitpick: I usually see unitless metric names in plural: openwec_events_received_total or even more popular, openwec_received_events_total. That way, they look more consistent next to metrics that have units, because https://prometheus.io/docs/practices/naming/ says: "should have a suffix describing the unit, in plural form".)

cea-sec / openwec