canonical / grafana-agent-k8s-operator

https://charmhub.io/grafana-agent-k8s
Apache License 2.0
8 stars 18 forks source link

Alert rules are injected with incorrect label filters #190

Closed chanchiwai-ray closed 1 year ago

chanchiwai-ray commented 1 year ago

Bug Description

Grafana agent charm has default alert rules for prometheus. However, when the charm is deployed and relate to COS, the alert rules are injected with incorrect JujuTopology (juju_*) labels, and that matches to no timeseries.

Take for example, the HostOomKillDetected alert rule. In Prometheus web UI, it has was modified to the following expr

Screenshot from 2023-05-12 10-59-21

However, in Prometheus, node_vmstat_oom_kill metric does not have any timeseries with those labels.

Screenshot from 2023-05-12 11-00-33

So there's a label mismatch between actual metric from grafana-agent via node_exporter and from the alert rule.

To Reproduce

  1. Deploy / with existing COS-lite
  2. Deploy ubuntu charm and grafana-agent machine charm
  3. relate ubuntu and grafana-agent; relate grafana-agent to COS-lite
  4. Open Prometheus web UI, check the alert rules and inspect the expr

Environment

Built from latest source @ 73543ef892912848c3babf062cbd17f8f7e20730, or lastest/edge

Relevant log output

None

Additional context

No response

rgildein commented 1 year ago

As we discussed today, this is not a bug, but a design decision which we want to improve. But at the same time, we came to the conclusion that grafana-agent charm should offer the possibility of adding basic alert rules that are linked with principal application (JujuTopology is provided w/ information about principal application and not grafana-agent).

These rules should not be configurable and should have reasonable thresholds so that they fire only in situations when it is really critical. An example is an alert rule for memory usage, which has a threshold set to 95%. Such a threshold is high and therefore generic, so if any principal application provides another alert rule for memory, it will definitely be with lower threshold (more specific for application).

simskij commented 1 year ago

@rgildein, you are mostly correct. 😅

There is a bug here where the alert rules get the juju_charm label, but the metrics and logs does not. We need to drop this label selector as it will mean that it will never match anything.

rgildein commented 1 year ago

You are right juju_charm is redundant and should be removed. However you also need to change labels for juju_application e.g. in Ray example juju_application=grafana-agent should be juju_application=ubuntu.

Also I did not realized that there is no juju_unit in topology, so we will need to add by (instance) to all alert rules, or not?

dstathis commented 1 year ago

@rgildein

I don't think there is any need to add juju_unit to alert rules. The rules are specific to the application but should apply to all units in that application.

rgildein commented 1 year ago

@rgildein

I don't think there is any need to add juju_unit to alert rules. The rules are specific to the application but should apply to all units in that application.

What I meant is that without using by (instance) (e.g. predict_linear(node_memory_MemUsed_percentage[30m], 5 * 60) >= 90), we will have just one alert instead of two in situation when memory potentially reach 90% usage on two machines. But I was wrong.

Screenshot from 2023-05-30 16-43-02

dstathis commented 1 year ago

Three different changes across three different repos are required for this.