Observability metrics need to be produced and exposed by the OpenWEC server.
Which metrics?
I think the following metrics would be interesting to have:
The number of HTTP requests received per second, total and per action (enumerate, events, heartbeat)
The response time of each HTTP requests per second, total and per action (enumerate, events, heartbeat)
The number of events received per second, total and per subscription
The number of events that could not be handled (because an output failed) per second (or bigger), total and per subscription. Would be helpful to detect if there is a problem with an output (for example if the file system is full).
The total number of machines seen per subscription (already covered by "openwec stats")
The number of active machines (received an event "recently") per subscription (already covered by "openwec stats")
The number of alive machines (received an heartbeat "recently") per subscription (already covered by "openwec stats")
The number of dead machines (didn't receive anything "recently") per subscription (already covered by "openwec stats")
From a developer's point of view, it would also be interesting to optionally add more timing metrics, for example to measure the amount of time spent in parts of the code. For example, when we receive a batch of events, it would be interesting to know how much time we spend decrypting, decompressing, parsing xml, formatting events, writing formatted events to each output, generating response and encrypting response.
Feel free to suggest other metrics!
Which protocol/format?
There are multiple ways to expose/transmit metrics. After a brief state of the art, I think we need to choose between:
Advanced features (because everything is calculated on the statsd server)
Cons:
(maybe) a lot of monitoring traffic?
(maybe) impact on performances?
Which library?
statsd: Cadence
OpenMetrics/Prometheus: prometheus_client
both: metrics-rs
I'm currently working on a prototype with prometheus_client where the OpenWEC server would expose a HTTP server dedicated to metrics (different listening addr/port).
Observability metrics need to be produced and exposed by the OpenWEC server.
Which metrics?
I think the following metrics would be interesting to have:
From a developer's point of view, it would also be interesting to optionally add more timing metrics, for example to measure the amount of time spent in parts of the code. For example, when we receive a batch of events, it would be interesting to know how much time we spend decrypting, decompressing, parsing xml, formatting events, writing formatted events to each output, generating response and encrypting response.
Feel free to suggest other metrics!
Which protocol/format?
There are multiple ways to expose/transmit metrics. After a brief state of the art, I think we need to choose between:
Both have pros and cons:
Which library?
I'm currently working on a prototype with
prometheus_client
where the OpenWEC server would expose a HTTP server dedicated to metrics (different listening addr/port).