canonical / grafana-agent-operator

This charmed operator automates the operational procedures of running Grafana Agent, an open-soruce telemetry collector.
https://charmhub.io/grafana-agent
Apache License 2.0
4 stars 8 forks source link

Allow config option to configure alert rules for memory #45

Open pengwyn opened 6 months ago

pengwyn commented 6 months ago

Enhancement Proposal

Currently the rule HostMemoryFull is hardcoded for > 95% memory usage of the node. This rule, and the others in src/prometheus_alert_rules would benefit from being configurable, either as:

a) individual options for each rule (e.g. include a host-memory-full-threshold) b) allow for individual overrides (e.g. override the expression for HostMemoryFull through a config option which is a dictionary) c) allow for per-file overrides (e.g. override the entire memory.rules file through an attach-resource)

The situation we are facing currently, coming from the nrpe charmed option, is that we'd like to limit the HostMemoryFull to around 5G of memory left. Because the node is large, but hosts many KVMs, this turns out to be a small percentage and constantly triggers the HostMemoryFull alert.

PietroPasotti commented 5 months ago

Could this be an instance of https://github.com/canonical/grafana-agent-operator/issues/41 ? If the system were really that full of VM storage, you'd likely want an alert.

Otherwise if you're sure you want to have a higher threshold, you can silence this alert in alertmanager and add a new one to replace it in cos-config.

pengwyn commented 5 months ago

This is not like #41, where the node is a nova-compute that is split between user workloads (using hugepages) and management software (using the remaining memory not blocked out by hugepages). In this issue, I'm referring to an infra node, in which there are many kvms running different control-plane workloads and no hugepages are configured.

Really, we don't care about the exact % of memory available for the entire 128GB node, but rather just the % for the small amount dedicated to non-kvm usage. For example, we'd rather know if more than 5GB/10GB of memory dedicated to the supervisor parts of the node has been used up.

(Edit: hit enter too soon) I like the option to config this through cos-config though. I haven't looked into that.