canonical / prometheus-k8s-operator

This charmed operator automates the operational procedures of running Prometheus, an open-source metrics backend.
https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 33 forks source link

Changing name of an alert creates a new alert rule #497

Closed Omgzilla closed 3 weeks ago

Omgzilla commented 1 year ago

Bug Description

We have just started to implement COS-lite stack into our production. When we create the charms, we add rules for grafana-agent to integrate onto prometheus, but when we change the name of a rule and refresh the application, a new rule gets created.

We were able to remove it by commented it out, push it with juju refresh, then remove the file and recreate the charm.

To Reproduce

  1. Include alert rules to a charm
  2. Pack it with charmcraft
  3. Deploy it
  4. Make relations between the charm -> grafana-agent -> COS-lite stack
  5. Check in grafana for alert rules
  6. Change alert name of a rule
  7. Pack it again with charmcraft
  8. Juju refresh
  9. Watch grafana for the new alert rule

Environment

Juju controller v.2.9.43 Clouds

Cross-Model using COS-lite and Grafana-Agent (edge)

Relevant log output

Will update this issue when I have reproduced the scenario.

Additional context

No response

PietroPasotti commented 1 year ago

We should not write alert rules to persistent storage and recalculate everything everytime and this will solve itself. Refactoring job.

sed-i commented 1 month ago

Repro attempt with grafana-agent-k8s

graph LR
grafana-agent ---|remote-write| prometheus
avalanche ---|metrics-endpoint| grafana-agent

We have two alerts from avalanche:

$ curl -s 10.1.207.168:9090/api/v1/rules | jq | grep '"av"' -C10 | grep alertname
                  "alertname": "AlwaysFiringDueToAbsentMetric",
                  "alertname": "AlwaysFiringDueToNumericValue",

Rename one:

diff --git a/src/prometheus_alert_rules/always_firing_absent.rule b/src/prometheus_alert_rules/always_firing_absent.rule
index 17f8b01..327197e 100644
--- a/src/prometheus_alert_rules/always_firing_absent.rule
+++ b/src/prometheus_alert_rules/always_firing_absent.rule
@@ -1,4 +1,4 @@
-alert: AlwaysFiringDueToAbsentMetric
+alert: AlwaysFiringDueToAbsentMetricRenamed
 expr: absent(some_metric_name_that_shouldnt_exist{job="non_existing_job"})
 for: 0m
 labels:

pack, refresh, and the renamed one is up to date, with no duplication:

 $ curl -s 10.1.207.168:9090/api/v1/rules | jq | grep '"av"' -C10 | grep alertname
                  "alertname": "AlwaysFiringDueToAbsentMetricRenamed",
                  "alertname": "AlwaysFiringDueToNumericValue",

Will try with a machine charm next.

sed-i commented 1 month ago

Repro attempt with grafana-agent

graph LR

subgraph lxd
grafana-agent --- ubuntu
hardware-observer --- ubuntu
grafana-agent --- hardware-observer
end

subgraph k8s
prometheus
end

prometheus --- grafana-agent

We have 78 alerts from hardware observer:

$ juju ssh --container prometheus prom/0 cat /etc/prometheus/rules/juju_welcome-lxd_82889f2e_hwo.rules | grep "alert:"
78

Then rename:

diff --git a/src/prometheus_alert_rules/ipmi_sensors.yaml b/src/prometheus_alert_rules/ipmi_sensors.yaml
index b83af12..82423f9 100644
--- a/src/prometheus_alert_rules/ipmi_sensors.yaml
+++ b/src/prometheus_alert_rules/ipmi_sensors.yaml
@@ -2,7 +2,7 @@ groups:
 - name: IpmiSensors
   rules:

-    - alert: IPMIMonitoringCommandFailed
+    - alert: IPMIMonitoringCommandFailedRenamed
       expr: ipmimonitoring_command_success == 0
       for: 5m
       labels:

Pack and refresh, and still 78 and the "...Renamed" rule is there.

sed-i commented 1 month ago

@Omgzilla I failed to reproduce this. Would you be able to paste the output of juju export-bundle from the lxd model if you encounter this again?

sed-i commented 3 weeks ago

Closing for now, but please do re-open if encountered again!