KumoCorp / kumomta

The first Open-Source high-performance MTA developed from the ground-up for high-volume email sending environments.
https://kumomta.com
Apache License 2.0
249 stars 33 forks source link

Add Minimum Queue Size Threshold to Metrics API #284

Open MHillyer opened 1 month ago

MHillyer commented 1 month ago

As a MailOps administrator I use the metrics API to monitor my servers looking to see if any queues are getting too large so I can investigate further.

My monitoring tool of choice is overwhelmed by the sheer number of metrics returned because I get data on each and every queue that is active, when I only care about larger queues.

If I can pass an argument to say what the smallest queue size is that I care about, I can significantly narrow down how much data the API returns, without losing any relevant information for my monitoring purposes.

For example, I could pass that min size is 1000 and any queues smaller than 1000 messages would not be returned by the metrics API.

wez commented 1 month ago

The situation behind this has more nuance than a simple minimum threshold for queue size, because the metrics don't know that they are associated with a queue, and the logic for exporting metrics doesn't know about the concept of a queue, because metrics and exporting is part of a separate crate independent from the queuing portion of kumod.

While we can fairly easily decide to set a minimum threshold for literally just the scheduled_count metric we cannot easily know in the exporter that the other half-dozen or so related metrics should be excluded if the scheduled_count is below some threshold without introducing quadratic complexity to the exporter where every metric needs to know about its overall relationship with others and resolve and evaluate the queue size metric from that association.

We could shift the responsibility for exclusion to the client by making them explicitly pass in the list of metrics and their thresholds as part of the /metrics GET request, but there are already quite a few different variations of metrics and rollups and that list would immediately become very cluttered and difficult to manage.

The way I'm leaning at the moment is that it might best to leave that sort of filtering logic to the prometheus configuration as discussed in https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/ and https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ because the configuration is at least easier to see and understand in the prometheus config file vs. all pushed into a giant HTTP URL.

In discussion with a customer, I got the impression that the prometheus export doesn't really work as well as desired at scale because the cardinality is so high, and there are some operational states around understanding the various causes of throttling that cannot be expressed in the relatively limited numerical form that prometheus supports. What I'm exploring at the moment is a non-prometheus endpoint that can more easily be constrained with thresholds and also show textual and timestamp information; for example, we could indicate that the maintainer has reached a connection cap to to hitting a specific provider throttle, it's name, and when that state came into effect.