cea-sec / openwec

An implementation of a Windows Event Collector server running on GNU/Linux.
GNU General Public License v3.0
51 stars 21 forks source link

Add observability metrics to OpenWEC server #189

Open vruello opened 2 weeks ago

vruello commented 2 weeks ago

Observability metrics need to be produced and exposed by the OpenWEC server.

Which metrics?

I think the following metrics would be interesting to have:

From a developer's point of view, it would also be interesting to optionally add more timing metrics, for example to measure the amount of time spent in parts of the code. For example, when we receive a batch of events, it would be interesting to know how much time we spend decrypting, decompressing, parsing xml, formatting events, writing formatted events to each output, generating response and encrypting response.

Feel free to suggest other metrics!

Which protocol/format?

There are multiple ways to expose/transmit metrics. After a brief state of the art, I think we need to choose between:

Both have pros and cons:

Which library?

I'm currently working on a prototype with prometheus_client where the OpenWEC server would expose a HTTP server dedicated to metrics (different listening addr/port).

tarokkk commented 2 weeks ago

Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.

vruello commented 2 weeks ago

Prometheus (I'm not sure about the rust client) also supports a push-based model via remote_write.

The Prometheus documentation states that "The remote write protocol is not intended for use by applications to push metrics to Prometheus remote-write-compatible receivers. It is intended that a Prometheus remote-write-compatible sender scrapes instrumented applications or exporters and sends remote write messages to a server." (https://prometheus.io/docs/specs/remote_write_spec/#background).

An application can push metrics to Prometheus Pushgateway (https://prometheus.io/docs/instrumenting/pushing/). However, the OpenWEC server does not seem to fit in the cases where the pushgateway should be used: https://prometheus.io/docs/practices/pushing/#should-i-be-using-the-pushgateway.

vruello commented 2 weeks ago

After a little more thought, I think metrics should be split in 2 categories:

Metrics intended for developers should probably be provided using tracing (and perhaps tracing-opentelemetry). This would require some work to transition from logging/log4rs but it would give us a lot of powerful tools and features to analyze the behavior of the server in depth.

vruello commented 2 weeks ago

I have two working prototypes:

I will test the second prototype in a real environment and see if it affects performance or not.

vruello commented 7 hours ago

After a few weeks of testing "in production", I am quite confident that the production of prometheus metrics using metrics-rs and metrics-exporter-prometheus has a negligible impact on performance.

The default buckets used to store the "request_duration" histogram are not really suitable for openwec. In our production environment, openwecd answers 50% of http requests in less than 1ms... I think it's fine to keep the "standard" (0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10) as the default, but it is probably better to change them in the configuration if you need an accurate view.

MrAnno commented 3 hours ago

metrics-rs looks pretty promising, helping manage the registry of metric objects without being too intrusive. It's also protocol-agnostic, which may come in handy later.

metrics intended for users The number of events received per second, total and per subscription

As far as I can see, openwec_event_received_total will be able to handle everything related to the input side of events, EPS will be just a rate() call on the Prometheus side. The only important thing for this metric is to create small enough partitions in order to allow informative queries on it later. High cardinality in partitions (labels) should be avoided though (especially if they are often active within a reasonable time range), because that would dramatically increase the stored data on the Prometheus side, and would also make queries slower.

All things considered, I think adding subscription to the labels will never be considered high cardinality and is crucial for providing helpful metrics. The next very important partition would be computer/machine/source (probably machine is the good name in the OpenWEC context), because even in the most basic use cases, one wants to know where the events are coming from, how many are there, which client is the loudest in a given time interval, or which group of clients is generating the most events. But depending on the use case, machine may become high cardinality, so I think that label should be configurable: on by default, but could be disabled in extremely big high-scale environments.

(Just random nitpick: I usually see unitless metric names in plural: openwec_events_received_total or even more popular, openwec_received_events_total. That way, they look more consistent next to metrics that have units, because https://prometheus.io/docs/practices/naming/ says: "should have a suffix describing the unit, in plural form".)