Add Minimum Queue Size Threshold to Metrics API

The situation behind this has more nuance than a simple minimum threshold for queue size, because the metrics don't know that they are associated with a queue, and the logic for exporting metrics doesn't know about the concept of a queue, because metrics and exporting is part of a separate crate independent from the queuing portion of kumod.

While we can fairly easily decide to set a minimum threshold for literally just the scheduled_count metric we cannot easily know in the exporter that the other half-dozen or so related metrics should be excluded if the scheduled_count is below some threshold without introducing quadratic complexity to the exporter where every metric needs to know about its overall relationship with others and resolve and evaluate the queue size metric from that association.

We could shift the responsibility for exclusion to the client by making them explicitly pass in the list of metrics and their thresholds as part of the /metrics GET request, but there are already quite a few different variations of metrics and rollups and that list would immediately become very cluttered and difficult to manage.

The way I'm leaning at the moment is that it might best to leave that sort of filtering logic to the prometheus configuration as discussed in https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/ and https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ because the configuration is at least easier to see and understand in the prometheus config file vs. all pushed into a giant HTTP URL.

In discussion with a customer, I got the impression that the prometheus export doesn't really work as well as desired at scale because the cardinality is so high, and there are some operational states around understanding the various causes of throttling that cannot be expressed in the relatively limited numerical form that prometheus supports. What I'm exploring at the moment is a non-prometheus endpoint that can more easily be constrained with thresholds and also show textual and timestamp information; for example, we could indicate that the maintainer has reached a connection cap to to hitting a specific provider throttle, it's name, and when that state came into effect.

KumoCorp / kumomta

Add Minimum Queue Size Threshold to Metrics API #284