Closed chanchiwai-ray closed 1 year ago
As we discussed today, this is not a bug, but a design decision which we want to improve. But at the same time, we came to the conclusion that grafana-agent charm should offer the possibility of adding basic alert rules that are linked with principal application (JujuTopology is provided w/ information about principal application and not grafana-agent).
These rules should not be configurable and should have reasonable thresholds so that they fire only in situations when it is really critical. An example is an alert rule for memory usage, which has a threshold set to 95%. Such a threshold is high and therefore generic, so if any principal application provides another alert rule for memory, it will definitely be with lower threshold (more specific for application).
@rgildein, you are mostly correct. 😅
There is a bug here where the alert rules get the juju_charm
label, but the metrics and logs does not. We need to drop this label selector as it will mean that it will never match anything.
You are right juju_charm
is redundant and should be removed. However you also need to change labels for juju_application
e.g. in Ray example juju_application=grafana-agent
should be juju_application=ubuntu
.
Also I did not realized that there is no juju_unit
in topology, so we will need to add by (instance)
to all alert rules, or not?
@rgildein
I don't think there is any need to add juju_unit
to alert rules. The rules are specific to the application but should apply to all units in that application.
@rgildein
I don't think there is any need to add
juju_unit
to alert rules. The rules are specific to the application but should apply to all units in that application.
What I meant is that without using by (instance)
(e.g. predict_linear(node_memory_MemUsed_percentage[30m], 5 * 60) >= 90
), we will have just one alert instead of two in situation when memory potentially reach 90% usage on two machines. But I was wrong.
Three different changes across three different repos are required for this.
Bug Description
Grafana agent charm has default alert rules for prometheus. However, when the charm is deployed and relate to COS, the alert rules are injected with incorrect JujuTopology (
juju_*
) labels, and that matches to no timeseries.Take for example, the
HostOomKillDetected
alert rule. In Prometheus web UI, it has was modified to the followingexpr
However, in Prometheus,
node_vmstat_oom_kill
metric does not have any timeseries with those labels.So there's a label mismatch between actual metric from grafana-agent via node_exporter and from the alert rule.
To Reproduce
expr
Environment
Built from latest source @ 73543ef892912848c3babf062cbd17f8f7e20730, or
lastest/edge
Relevant log output
Additional context
No response