Open pengwyn opened 6 months ago
Could this be an instance of https://github.com/canonical/grafana-agent-operator/issues/41 ? If the system were really that full of VM storage, you'd likely want an alert.
Otherwise if you're sure you want to have a higher threshold, you can silence this alert in alertmanager and add a new one to replace it in cos-config.
This is not like #41, where the node is a nova-compute that is split between user workloads (using hugepages) and management software (using the remaining memory not blocked out by hugepages). In this issue, I'm referring to an infra node, in which there are many kvms running different control-plane workloads and no hugepages are configured.
Really, we don't care about the exact % of memory available for the entire 128GB node, but rather just the % for the small amount dedicated to non-kvm usage. For example, we'd rather know if more than 5GB/10GB of memory dedicated to the supervisor parts of the node has been used up.
(Edit: hit enter too soon) I like the option to config this through cos-config though. I haven't looked into that.
Enhancement Proposal
Currently the rule HostMemoryFull is hardcoded for > 95% memory usage of the node. This rule, and the others in
src/prometheus_alert_rules
would benefit from being configurable, either as:a) individual options for each rule (e.g. include a
host-memory-full-threshold
) b) allow for individual overrides (e.g. override the expression forHostMemoryFull
through a config option which is a dictionary) c) allow for per-file overrides (e.g. override the entirememory.rules
file through an attach-resource)The situation we are facing currently, coming from the nrpe charmed option, is that we'd like to limit the HostMemoryFull to around 5G of memory left. Because the node is large, but hosts many KVMs, this turns out to be a small percentage and constantly triggers the HostMemoryFull alert.