Closed samuelallan72 closed 9 months ago
Just to give a complete picture about the relation data and the generated prometheus configs.
Hello,
Just to clarify, is ceph-mon a machine or k8s charm?
I see how this doesn't work. Prometheus was not designed to relate to the same application twice. I'm not sure I understand why you need grafana-agent and prometheus-scrape. I think you can use just grafana-agent to get the same metrics.
@dstathis
Just to clarify, is ceph-mon a machine or k8s charm?
A machine charm.
I think you can use just grafana-agent to get the same metrics.
We need both: grafana-agent for the host metrics, and ceph-mon for Ceph-specific metrics. Also, ceph-mon provides alert rules for prometheus which are required as well.
Also please note that it's not related to the same application twice - one is grafana-agent, the other is ceph-mon (or in this specific case, prometheus-scrape-config).
I'm not an expert about it, but it looks like there would be a way to distinguish those two relations easily by adding more suffix to the file name such as relation-id
, related-endpoint
, related-units
so it can be unique.
- relation-id: 30
endpoint: metrics-endpoint
related-endpoint: metrics-endpoint
application-data:
alert_rules:
...
related-units:
prometheus-scrape-config-ceph/0:
in-scope: true
data:
egress-subnets: 10.152.183.146/32
ingress-address: 10.152.183.146
private-address: 10.152.183.146
The context - juju status
[cos model]
Model Controller Cloud/Region Version SLA Timestamp
cos maas-controller cos-microk8s/localhost 3.1.6 unsupported 12:23:32Z
App Version Status Scale Charm Channel Rev Address Exposed Message
alertmanager 0.25.0 active 1 alertmanager-k8s edge 96 10.152.183.159 no
catalogue active 1 catalogue-k8s edge 31 10.152.183.220 no
cos-configuration-ceph 3.5.0 active 1 cos-configuration-k8s latest/edge 39 10.152.183.129 no
grafana 9.2.1 active 1 grafana-k8s edge 93 10.152.183.132 no
loki 2.7.4 active 1 loki-k8s edge 104 10.152.183.143 no
prometheus 2.47.2 active 1 prometheus-k8s edge 156 10.152.183.140 no
prometheus-scrape-config-ceph n/a active 1 prometheus-scrape-config-k8s latest/edge 44 10.152.183.146 no
traefik 2.10.4 active 1 traefik-k8s edge 164 192.168.151.81 no
Unit Workload Agent Address Ports Message
alertmanager/0* active idle 10.1.54.84
catalogue/0* active idle 10.1.54.72
cos-configuration-ceph/0* active idle 10.1.54.74
grafana/0* active idle 10.1.54.86
loki/0* active idle 10.1.54.87
prometheus-scrape-config-ceph/0* active idle 10.1.54.80
prometheus/0* active idle 10.1.54.88
traefik/0* active idle 10.1.54.85
Offer Application Charm Rev Connected Endpoint Interface Role
alertmanager-karma-dashboard alertmanager alertmanager-k8s 96 0/0 karma-dashboard karma_dashboard provider
grafana-dashboards grafana grafana-k8s 93 3/3 grafana-dashboard grafana_dashboard requirer
loki-logging loki loki-k8s 104 2/2 logging loki_push_api provider
prometheus-receive-remote-write prometheus prometheus-k8s 156 3/3 receive-remote-write prometheus_remote_write provider
prometheus-scrape prometheus prometheus-k8s 156 0/0 metrics-endpoint prometheus_scrape requirer
prometheus-scrape-config-ceph prometheus-scrape-config-ceph prometheus-scrape-config-k8s 44 1/1 configurable-scrape-jobs prometheus_scrape requirer
Integration provider Requirer Interface Type Message
alertmanager:alerting loki:alertmanager alertmanager_dispatch regular
alertmanager:alerting prometheus:alertmanager alertmanager_dispatch regular
alertmanager:grafana-dashboard grafana:grafana-dashboard grafana_dashboard regular
alertmanager:grafana-source grafana:grafana-source grafana_datasource regular
alertmanager:replicas alertmanager:replicas alertmanager_replica peer
alertmanager:self-metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
catalogue:catalogue alertmanager:catalogue catalogue regular
catalogue:catalogue grafana:catalogue catalogue regular
catalogue:catalogue prometheus:catalogue catalogue regular
catalogue:replicas catalogue:replicas catalogue_replica peer
cos-configuration-ceph:grafana-dashboards grafana:grafana-dashboard grafana_dashboard regular
cos-configuration-ceph:prometheus-config prometheus:metrics-endpoint prometheus_scrape regular
cos-configuration-ceph:replicas cos-configuration-ceph:replicas cos_configuration_replica peer
grafana:grafana grafana:grafana grafana_peers peer
grafana:metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
grafana:replicas grafana:replicas grafana_replicas peer
loki:grafana-dashboard grafana:grafana-dashboard grafana_dashboard regular
loki:grafana-source grafana:grafana-source grafana_datasource regular
loki:metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
loki:replicas loki:replicas loki_replica peer
prometheus-scrape-config-ceph:metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
prometheus:grafana-dashboard grafana:grafana-dashboard grafana_dashboard regular
prometheus:grafana-source grafana:grafana-source grafana_datasource regular
prometheus:prometheus-peers prometheus:prometheus-peers prometheus_peers peer
traefik:ingress alertmanager:ingress ingress regular
traefik:ingress catalogue:ingress ingress regular
traefik:ingress-per-unit loki:ingress ingress_per_unit regular
traefik:ingress-per-unit prometheus:ingress ingress_per_unit regular
traefik:metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
traefik:peers traefik:peers traefik_peers peer
traefik:traefik-route grafana:ingress traefik_route regular
[ceph model]
Model Controller Cloud/Region Version SLA Timestamp
ceph maas-controller maas/default 3.1.6 unsupported 12:24:01Z
SAAS Status Store URL
cos-alertmanager-karma-dashboard active maas-controller admin/cos.alertmanager-karma-dashboard
cos-grafana-dashboards active maas-controller admin/cos.grafana-dashboards
cos-loki-logging active maas-controller admin/cos.loki-logging
cos-prometheus-receive-remote-write active maas-controller admin/cos.prometheus-receive-remote-write
cos-prometheus-scrape active maas-controller admin/cos.prometheus-scrape
cos-prometheus-scrape-config-ceph active maas-controller admin/cos.prometheus-scrape-config-ceph
App Version Status Scale Charm Channel Rev Exposed Message
ceph-dashboard active 3 ceph-dashboard quincy/stable 48 no Unit is ready
ceph-fs 17.2.6 active 1 ceph-fs quincy/stable 60 no Unit is ready
ceph-iscsi active 2 ceph-iscsi quincy/stable 27 no Unit is ready
ceph-loadbalancer active 1 openstack-loadbalancer jammy/stable 10 no Unit is ready
ceph-mon 17.2.6 active 3 ceph-mon quincy/stable 194 no Unit is ready and clustered
ceph-nfs active 1 ceph-nfs quincy/stable 8 no Unit is ready
ceph-osd 17.2.6 active 3 ceph-osd quincy/stable 576 no Unit is ready (2 OSD)
ceph-radosgw 17.2.6 active 1 ceph-radosgw quincy/stable 564 no Unit is ready
grafana-agent active 11 grafana-agent latest/edge 20 no
ntp 4.2 active 3 ntp stable 50 no chrony: Ready
vault 1.8.8 active 1 vault 1.8/stable 183 no Unit is ready (active: true, mlock: disabled)
Unit Workload Agent Machine Public address Ports Message
ceph-fs/0* active idle 0/lxd/0 192.168.151.108 Unit is ready
grafana-agent/12 active idle 192.168.151.108
ceph-iscsi/0* active idle 1 192.168.151.103 Unit is ready
ceph-iscsi/1 active idle 2 192.168.151.104 Unit is ready
ceph-loadbalancer/0* active idle 1/lxd/0 192.168.151.107 Unit is ready
grafana-agent/14 active idle 192.168.151.107
ceph-mon/0 active idle 0/lxd/1 192.168.151.112 Unit is ready and clustered
ceph-dashboard/2 active idle 192.168.151.112 Unit is ready
grafana-agent/13 active idle 192.168.151.112
ceph-mon/1 active idle 1/lxd/1 192.168.151.110 Unit is ready and clustered
ceph-dashboard/0* active idle 192.168.151.110 Unit is ready
grafana-agent/9 active idle 192.168.151.110
ceph-mon/2* active idle 2/lxd/0 192.168.151.109 Unit is ready and clustered
ceph-dashboard/1 active idle 192.168.151.109 Unit is ready
grafana-agent/7 active idle 192.168.151.109
ceph-nfs/0* active idle 1/lxd/2 192.168.151.111 Unit is ready
grafana-agent/15 active idle 192.168.151.111
ceph-osd/0* active idle 0 192.168.151.102 Unit is ready (2 OSD)
grafana-agent/2* active idle 192.168.151.102
ntp/1 active idle 192.168.151.102 123/udp chrony: Ready
ceph-osd/1 active idle 1 192.168.151.103 Unit is ready (2 OSD)
grafana-agent/1 active idle 192.168.151.103
ntp/2 active idle 192.168.151.103 123/udp chrony: Ready
ceph-osd/2 active idle 2 192.168.151.104 Unit is ready (2 OSD)
grafana-agent/6 active idle 192.168.151.104
ntp/0* active idle 192.168.151.104 123/udp chrony: Ready
ceph-radosgw/0* active idle 0/lxd/2 192.168.151.113 80/tcp Unit is ready
grafana-agent/8 active idle 192.168.151.113
vault/0* active idle 2/lxd/1 192.168.151.106 8200/tcp Unit is ready (active: true, mlock: disabled)
grafana-agent/16 active idle 192.168.151.106
Machine State Address Inst id Base AZ Message
0 started 192.168.151.102 wise-fowl ubuntu@22.04 default Deployed
0/lxd/0 started 192.168.151.108 juju-194608-0-lxd-0 ubuntu@22.04 default Container started
0/lxd/1 started 192.168.151.112 juju-194608-0-lxd-1 ubuntu@22.04 default Container started
0/lxd/2 started 192.168.151.113 juju-194608-0-lxd-2 ubuntu@22.04 default Container started
1 started 192.168.151.103 key-orca ubuntu@22.04 default Deployed
1/lxd/0 started 192.168.151.107 juju-194608-1-lxd-0 ubuntu@22.04 default Container started
1/lxd/1 started 192.168.151.110 juju-194608-1-lxd-1 ubuntu@22.04 default Container started
1/lxd/2 started 192.168.151.111 juju-194608-1-lxd-2 ubuntu@22.04 default Container started
2 started 192.168.151.104 stable-roughy ubuntu@22.04 default Deployed
2/lxd/0 started 192.168.151.109 juju-194608-2-lxd-0 ubuntu@22.04 default Container started
2/lxd/1 started 192.168.151.106 juju-194608-2-lxd-1 ubuntu@22.04 default Container started
Integration provider Requirer Interface Type Message
ceph-fs:juju-info grafana-agent:juju-info juju-info subordinate
ceph-iscsi:admin-access ceph-dashboard:iscsi-dashboard ceph-iscsi-admin-access regular
ceph-iscsi:cluster ceph-iscsi:cluster ceph-iscsi-peer peer
ceph-loadbalancer:juju-info grafana-agent:juju-info juju-info subordinate
ceph-loadbalancer:loadbalancer ceph-dashboard:loadbalancer openstack-loadbalancer regular
ceph-mon:client ceph-iscsi:ceph-client ceph-client regular
ceph-mon:client ceph-nfs:ceph-client ceph-client regular
ceph-mon:dashboard ceph-dashboard:dashboard ceph-dashboard subordinate
ceph-mon:juju-info grafana-agent:juju-info juju-info subordinate
ceph-mon:mds ceph-fs:ceph-mds ceph-mds regular
ceph-mon:metrics-endpoint cos-prometheus-scrape-config-ceph:configurable-scrape-jobs prometheus_scrape regular
ceph-mon:mon ceph-mon:mon ceph peer
ceph-mon:osd ceph-osd:mon ceph-osd regular
ceph-mon:radosgw ceph-radosgw:mon ceph-radosgw regular
ceph-nfs:cluster ceph-nfs:cluster ceph-nfs-peer peer
ceph-nfs:juju-info grafana-agent:juju-info juju-info subordinate
ceph-osd:juju-info grafana-agent:juju-info juju-info subordinate
ceph-osd:juju-info ntp:juju-info juju-info subordinate
ceph-radosgw:cluster ceph-radosgw:cluster swift-ha peer
ceph-radosgw:juju-info grafana-agent:juju-info juju-info subordinate
cos-loki-logging:logging grafana-agent:logging-consumer loki_push_api regular
cos-prometheus-receive-remote-write:receive-remote-write grafana-agent:send-remote-write prometheus_remote_write regular
grafana-agent:grafana-dashboards-provider cos-grafana-dashboards:grafana-dashboard grafana_dashboard regular
grafana-agent:peers grafana-agent:peers grafana_agent_replica peer
ntp:ntp-peers ntp:ntp-peers ntp peer
vault:certificates ceph-dashboard:certificates tls-certificates regular
vault:certificates ceph-iscsi:certificates tls-certificates regular
vault:certificates ceph-radosgw:certificates tls-certificates regular
vault:cluster vault:cluster vault-ha peer
vault:juju-info grafana-agent:juju-info juju-info subordinate
prometheus-k8s-operator (unique_identifiers=)$ git rev-parse HEAD
cd1615f21c3e38c6fc56356b95330186b0de5a32
$ charmcraft pack
$ juju refresh prometheus --path ./prometheus-k8s_ubuntu-20.04-amd64.charm
I only see one file per application although I suppose to see one rule file for grafana-agent host metrics + another rule file for a specific service.
# ls -l /etc/prometheus/rules/
total 280
-rw-r--r-- 1 root root 55997 Nov 22 15:31 juju_ceph_57cd5e92_ceph-mon_metrics-endpoint_30.rules
-rw-r--r-- 1 root root 26899 Nov 22 15:31 juju_ceph_57cd5e92_ceph-osd.rules
-rw-r--r-- 1 root root 27715 Nov 22 15:31 juju_controller_1f525a2f_controller.rules
-rw-r--r-- 1 root root 151705 Nov 22 15:31 juju_cos-microk8s_c4865096_microk8s.rules
-rw-r--r-- 1 root root 2552 Nov 22 15:31 juju_cos_344a17da_alertmanager_metrics-endpoint_18.rules
-rw-r--r-- 1 root root 1416 Nov 22 15:31 juju_cos_344a17da_grafana_metrics-endpoint_20.rules
-rw-r--r-- 1 root root 3135 Nov 22 15:31 juju_cos_344a17da_loki_metrics-endpoint_19.rules
-rw-r--r-- 1 root root 1327 Nov 22 15:31 juju_cos_344a17da_traefik_metrics-endpoint_17.rules
Can I see juju status --relations
? I am a bit confused how this could have not worked.
Could you also post juju debug-log
?
Can I see
juju status --relations
? I am a bit confused how this could have not worked.
Juju status is in https://github.com/canonical/prometheus-k8s-operator/issues/551#issuecomment-1822676148
I'm confused. juju_ceph_57cd5e92_ceph-osd.rules
doesn't follow the new syntax as identifier = f"{identifier}_{relation.name}_{relation.id}"
.
Are alert_rules from grafana-agent handled in a different place of the code?
I did some quick verbose logging and it didn't capture anything around ceph-osd.
$ git diff
diff --git a/lib/charms/prometheus_k8s/v0/prometheus_scrape.py b/lib/charms/prometheus_k8s/v0/prometheus_scrape.py
index 6392737..1530f73 100644
--- a/lib/charms/prometheus_k8s/v0/prometheus_scrape.py
+++ b/lib/charms/prometheus_k8s/v0/prometheus_scrape.py
@@ -1000,16 +1000,21 @@ class MetricsEndpointConsumer(Object):
"""
alerts = {} # type: Dict[str, dict] # mapping b/w juju identifiers and alert rule files
for relation in self._charm.model.relations[self._relation_name]:
+ logger.warn(f"{relation.units = }")
+ logger.warn(f"{relation.app = }")
if not relation.units or not relation.app:
continue
alert_rules = json.loads(relation.data[relation.app].get("alert_rules", "{}"))
if not alert_rules:
+ logger.warn("not alert_rules")
continue
alert_rules = self._inject_alert_expr_labels(alert_rules)
identifier, topology = self._get_identifier_by_alert_rules(alert_rules)
+ logger.warn(f"{identifier = }")
+ logger.warn(f"{topology = }")
if not topology:
try:
scrape_metadata = json.loads(relation.data[relation.app]["scrape_metadata"])
@@ -1032,13 +1037,16 @@ class MetricsEndpointConsumer(Object):
# We need to append the relation info to the identifier. This is to allow for cases for there are two
# relations which eventually scrape the same application. Issue #551.
identifier = f"{identifier}_{relation.name}_{relation.id}"
+ logger.warn(f"{identifier = }")
alerts[identifier] = alert_rules
_, errmsg = self._tool.validate_alert_rules(alert_rules)
+ logger.warn(f"{errmsg = }")
if errmsg:
if alerts[identifier]:
del alerts[identifier]
+ logger.warn("del alerts[identifier]")
if self._charm.unit.is_leader():
data = json.loads(relation.data[self._charm.app].get("event", "{}"))
data["errors"] = errmsg
unit-prometheus-0: 15:48:55 INFO unit.prometheus/0.juju-log Pushed new configuration
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit traefik/0>}
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application traefik>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_traefik'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3610>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_traefik_metrics-endpoint_17'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit alertmanager/0>}
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application alertmanager>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_alertmanager'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3580>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_alertmanager_metrics-endpoint_18'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit loki/0>}
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application loki>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_loki'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3e20>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_loki_metrics-endpoint_19'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit grafana/0>}
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application grafana>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_grafana'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3cd0>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_grafana_metrics-endpoint_20'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit cos-configuration-ceph/0>}
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application cos-configuration-ceph>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log not alert_rules
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit prometheus-scrape-config-ceph/0>}
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application prometheus-scrape-config-ceph>
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log identifier = 'ceph_57cd5e92_ceph-mon'
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3340>
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log identifier = 'ceph_57cd5e92_ceph-mon_metrics-endpoint_30'
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_traefik_metrics-endpoint_17.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_alertmanager_metrics-endpoint_18.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_loki_metrics-endpoint_19.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_grafana_metrics-endpoint_20.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_ceph_57cd5e92_ceph-mon_metrics-endpoint_30.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_controller_1f525a2f_controller.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_ceph_57cd5e92_ceph-osd.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos-microk8s_c4865096_microk8s.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Building pebble layer
Why were only ceph-osd alert_rules created with grafana-agent can be explained by https://github.com/canonical/grafana-agent-operator/issues/17 so it's a separate issue.
ceph-osd/0* active idle 0 192.168.151.102 Unit is ready (2 OSD)
grafana-agent/2* active idle 192.168.151.102
tl;dr all good with the proposed patch as cd1615f21c3e38c6fc56356b95330186b0de5a32
Okay, the testing of the proposed patch was distracted by https://github.com/canonical/grafana-agent-operator/issues/17 so I redeployed my test bed and now I have the following as a dedicated grafana-agent-ceph-mon
to properly reproduce the original issue.
Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active idle 0/lxd/1 192.168.151.109 Unit is ready and clustered
ceph-dashboard/2 active idle 192.168.151.109 Unit is ready
grafana-agent-ceph-mon/1 active idle 192.168.151.109
ceph-mon/1 active idle 1/lxd/1 192.168.151.111 Unit is ready and clustered
ceph-dashboard/1 active idle 192.168.151.111 Unit is ready
grafana-agent-ceph-mon/2 active idle 192.168.151.111
ceph-mon/2* active idle 2/lxd/0 192.168.151.108 Unit is ready and clustered
ceph-dashboard/0* active idle 192.168.151.108 Unit is ready
grafana-agent-ceph-mon/0* active idle 192.168.151.108
[before patching]
root@prometheus-0:/# ls -1 /etc/prometheus/rules/
juju_ceph_456c92bf_ceph-mon.rules
juju_ceph_456c92bf_ceph-osd.rules
juju_controller_ef2e4e1a_controller.rules
juju_cos-microk8s_59d81459_microk8s.rules
juju_cos_41fa8883_alertmanager.rules
juju_cos_41fa8883_grafana.rules
juju_cos_41fa8883_loki.rules
juju_cos_41fa8883_traefik.rules
[after patching]
root@prometheus-0:/# ls -1 /etc/prometheus/rules/
juju_ceph_456c92bf_ceph-mon.rules
juju_ceph_456c92bf_ceph-mon_metrics-endpoint_30.rules
juju_ceph_456c92bf_ceph-osd.rules
juju_controller_ef2e4e1a_controller.rules
juju_cos-microk8s_59d81459_microk8s.rules
juju_cos_41fa8883_alertmanager_metrics-endpoint_18.rules
juju_cos_41fa8883_grafana_metrics-endpoint_20.rules
juju_cos_41fa8883_loki_metrics-endpoint_19.rules
juju_cos_41fa8883_traefik_metrics-endpoint_17.rules
[both alert rules from grafana-agent for host metrics and ceph-mon for Ceph service are visible now with separate file names]
Bug Description
Alert rules from different sources can overwrite each other. The root cause is that sets of alert rules from different sources can have the same identifier. When this happens, it results in only one set of alert rules being written to prometheus config, as rules processed later overwrite those processed earlier (whether on disk, or in the dictionary-building iteration logic).
We have seen this in an environment where we have relations that look like this:
In this case, the host rules from grafana-agent on the ceph-mon host received the same identifier as the rules from ceph-mon through prometheus-scrape-config. Identifier was
<juju-model-name>_<hash>_ceph-mon
- same identifier because the identifier derivation function seems to be based on the juju model and application (which was the same for both rule sources). So the filenames for both rulesets were the same. Then the metrics endpoint rules were written to file, followed by the receive-remote-write rules, which overwrote the previous ones.To Reproduce
Environment
Relevant log output
Additional context
NA