Open vruello opened 2 weeks ago
Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.
Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.
The Prometheus documentation states that "The remote write protocol is not intended for use by applications to push metrics to Prometheus remote-write-compatible receivers. It is intended that a Prometheus remote-write-compatible sender scrapes instrumented applications or exporters and sends remote write messages to a server." (https://prometheus.io/docs/specs/remote_write_spec/#background).
An application can push metrics to Prometheus Pushgateway (https://prometheus.io/docs/instrumenting/pushing/). However, the OpenWEC server does not seem to fit in the cases where the pushgateway should be used: https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway.
After a little more thought, I think metrics should be split in 2 categories:
Metrics intended for developers should probably be provided using tracing
(and perhaps tracing-opentelemetry
). This would require some work to transition from logging
/log4rs
but it would give us a lot of powerful tools and features to analyze the behavior of the server in depth.
I have two working prototypes:
prometheus_client
. I am not very happy with it. A lot of boilerplate code is needed to run the HTTP server needed to expose the metrics. Also, metrics "objects" (counters, histograms, ...) have to be passed into each function call which is annoying.metrics
and metrics-exporter-prometheus
. I really like the idea that metrics are independent of the exporter (prometheus here). It is also very easy to use (macros like log crate). Maybe it is a bit slower than prometheus_client
, but I don't think that it is significant.I will test the second prototype in a real environment and see if it affects performance or not.
After a few weeks of testing "in production", I am quite confident that the production of prometheus metrics using metrics-rs
and metrics-exporter-prometheus
has a negligible impact on performance.
The default buckets used to store the "request_duration" histogram are not really suitable for openwec. In our production environment, openwecd answers 50% of http requests in less than 1ms... I think it's fine to keep the "standard" (0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10) as the default, but it is probably better to change them in the configuration if you need an accurate view.
metrics-rs
looks pretty promising, helping manage the registry of metric objects without being too intrusive. It's also protocol-agnostic, which may come in handy later.
metrics intended for users The number of events received per second, total and per subscription
As far as I can see, openwec_event_received_total
will be able to handle everything related to the input side of events, EPS will be just a rate()
call on the Prometheus side. The only important thing for this metric is to create small enough partitions in order to allow informative queries on it later. High cardinality in partitions (labels) should be avoided though (especially if they are often active within a reasonable time range), because that would dramatically increase the stored data on the Prometheus side, and would also make queries slower.
All things considered, I think adding subscription
to the labels will never be considered high cardinality and is crucial for providing helpful metrics. The next very important partition would be computer
/machine
/source
(probably machine is the good name in the OpenWEC context), because even in the most basic use cases, one wants to know where the events are coming from, how many are there, which client is the loudest in a given time interval, or which group of clients is generating the most events.
But depending on the use case, machine
may become high cardinality, so I think that label should be configurable: on
by default, but could be disabled in extremely big high-scale environments.
(Just random nitpick: I usually see unitless metric names in plural: openwec_events_received_total
or even more popular, openwec_received_events_total
. That way, they look more consistent next to metrics that have units, because https://prometheus.io/docs/practices/naming/ says: "should have a suffix describing the unit, in plural form".)
Observability metrics need to be produced and exposed by the OpenWEC server.
Which metrics?
I think the following metrics would be interesting to have:
From a developer's point of view, it would also be interesting to optionally add more timing metrics, for example to measure the amount of time spent in parts of the code. For example, when we receive a batch of events, it would be interesting to know how much time we spend decrypting, decompressing, parsing xml, formatting events, writing formatted events to each output, generating response and encrypting response.
Feel free to suggest other metrics!
Which protocol/format?
There are multiple ways to expose/transmit metrics. After a brief state of the art, I think we need to choose between:
Both have pros and cons:
Which library?
I'm currently working on a prototype with
prometheus_client
where the OpenWEC server would expose a HTTP server dedicated to metrics (different listening addr/port).