canonical / prometheus-k8s-operator

This charmed operator automates the operational procedures of running Prometheus, an open-source metrics backend.
https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 34 forks source link

Alert rules from different sources can overwrite each other #551

Closed samuelallan72 closed 9 months ago

samuelallan72 commented 10 months ago

Bug Description

Alert rules from different sources can overwrite each other. The root cause is that sets of alert rules from different sources can have the same identifier. When this happens, it results in only one set of alert rules being written to prometheus config, as rules processed later overwrite those processed earlier (whether on disk, or in the dictionary-building iteration logic).

We have seen this in an environment where we have relations that look like this:

In this case, the host rules from grafana-agent on the ceph-mon host received the same identifier as the rules from ceph-mon through prometheus-scrape-config. Identifier was <juju-model-name>_<hash>_ceph-mon - same identifier because the identifier derivation function seems to be based on the juju model and application (which was the same for both rule sources). So the filenames for both rulesets were the same. Then the metrics endpoint rules were written to file, followed by the receive-remote-write rules, which overwrote the previous ones.

To Reproduce

Environment

Relevant log output

NA

Additional context

NA

nobuto-m commented 9 months ago

Just to give a complete picture about the relation data and the generated prometheus configs.

prometheus-show-unit.yaml.gz

prometheus_conf.tar.gz

dstathis commented 9 months ago

Hello,

Just to clarify, is ceph-mon a machine or k8s charm?

I see how this doesn't work. Prometheus was not designed to relate to the same application twice. I'm not sure I understand why you need grafana-agent and prometheus-scrape. I think you can use just grafana-agent to get the same metrics.

samuelallan72 commented 9 months ago

@dstathis

Just to clarify, is ceph-mon a machine or k8s charm?

A machine charm.

I think you can use just grafana-agent to get the same metrics.

We need both: grafana-agent for the host metrics, and ceph-mon for Ceph-specific metrics. Also, ceph-mon provides alert rules for prometheus which are required as well.

Also please note that it's not related to the same application twice - one is grafana-agent, the other is ceph-mon (or in this specific case, prometheus-scrape-config).

nobuto-m commented 9 months ago

I'm not an expert about it, but it looks like there would be a way to distinguish those two relations easily by adding more suffix to the file name such as relation-id, related-endpoint, related-units so it can be unique.

  - relation-id: 30
    endpoint: metrics-endpoint
    related-endpoint: metrics-endpoint
    application-data:
      alert_rules:

...

    related-units:
      prometheus-scrape-config-ceph/0:
        in-scope: true
        data:
          egress-subnets: 10.152.183.146/32
          ingress-address: 10.152.183.146
          private-address: 10.152.183.146
nobuto-m commented 9 months ago

The context - juju status

[cos model]

Model  Controller       Cloud/Region            Version  SLA          Timestamp
cos    maas-controller  cos-microk8s/localhost  3.1.6    unsupported  12:23:32Z

App                            Version  Status  Scale  Charm                         Channel      Rev  Address         Exposed  Message
alertmanager                   0.25.0   active      1  alertmanager-k8s              edge          96  10.152.183.159  no       
catalogue                               active      1  catalogue-k8s                 edge          31  10.152.183.220  no       
cos-configuration-ceph         3.5.0    active      1  cos-configuration-k8s         latest/edge   39  10.152.183.129  no       
grafana                        9.2.1    active      1  grafana-k8s                   edge          93  10.152.183.132  no       
loki                           2.7.4    active      1  loki-k8s                      edge         104  10.152.183.143  no       
prometheus                     2.47.2   active      1  prometheus-k8s                edge         156  10.152.183.140  no       
prometheus-scrape-config-ceph  n/a      active      1  prometheus-scrape-config-k8s  latest/edge   44  10.152.183.146  no       
traefik                        2.10.4   active      1  traefik-k8s                   edge         164  192.168.151.81  no       

Unit                              Workload  Agent  Address     Ports  Message
alertmanager/0*                   active    idle   10.1.54.84         
catalogue/0*                      active    idle   10.1.54.72         
cos-configuration-ceph/0*         active    idle   10.1.54.74         
grafana/0*                        active    idle   10.1.54.86         
loki/0*                           active    idle   10.1.54.87         
prometheus-scrape-config-ceph/0*  active    idle   10.1.54.80         
prometheus/0*                     active    idle   10.1.54.88         
traefik/0*                        active    idle   10.1.54.85         

Offer                            Application                    Charm                         Rev  Connected  Endpoint                  Interface                Role
alertmanager-karma-dashboard     alertmanager                   alertmanager-k8s              96   0/0        karma-dashboard           karma_dashboard          provider
grafana-dashboards               grafana                        grafana-k8s                   93   3/3        grafana-dashboard         grafana_dashboard        requirer
loki-logging                     loki                           loki-k8s                      104  2/2        logging                   loki_push_api            provider
prometheus-receive-remote-write  prometheus                     prometheus-k8s                156  3/3        receive-remote-write      prometheus_remote_write  provider
prometheus-scrape                prometheus                     prometheus-k8s                156  0/0        metrics-endpoint          prometheus_scrape        requirer
prometheus-scrape-config-ceph    prometheus-scrape-config-ceph  prometheus-scrape-config-k8s  44   1/1        configurable-scrape-jobs  prometheus_scrape        requirer

Integration provider                            Requirer                         Interface                  Type     Message
alertmanager:alerting                           loki:alertmanager                alertmanager_dispatch      regular  
alertmanager:alerting                           prometheus:alertmanager          alertmanager_dispatch      regular  
alertmanager:grafana-dashboard                  grafana:grafana-dashboard        grafana_dashboard          regular  
alertmanager:grafana-source                     grafana:grafana-source           grafana_datasource         regular  
alertmanager:replicas                           alertmanager:replicas            alertmanager_replica       peer     
alertmanager:self-metrics-endpoint              prometheus:metrics-endpoint      prometheus_scrape          regular  
catalogue:catalogue                             alertmanager:catalogue           catalogue                  regular  
catalogue:catalogue                             grafana:catalogue                catalogue                  regular  
catalogue:catalogue                             prometheus:catalogue             catalogue                  regular  
catalogue:replicas                              catalogue:replicas               catalogue_replica          peer     
cos-configuration-ceph:grafana-dashboards       grafana:grafana-dashboard        grafana_dashboard          regular  
cos-configuration-ceph:prometheus-config        prometheus:metrics-endpoint      prometheus_scrape          regular  
cos-configuration-ceph:replicas                 cos-configuration-ceph:replicas  cos_configuration_replica  peer     
grafana:grafana                                 grafana:grafana                  grafana_peers              peer     
grafana:metrics-endpoint                        prometheus:metrics-endpoint      prometheus_scrape          regular  
grafana:replicas                                grafana:replicas                 grafana_replicas           peer     
loki:grafana-dashboard                          grafana:grafana-dashboard        grafana_dashboard          regular  
loki:grafana-source                             grafana:grafana-source           grafana_datasource         regular  
loki:metrics-endpoint                           prometheus:metrics-endpoint      prometheus_scrape          regular  
loki:replicas                                   loki:replicas                    loki_replica               peer     
prometheus-scrape-config-ceph:metrics-endpoint  prometheus:metrics-endpoint      prometheus_scrape          regular  
prometheus:grafana-dashboard                    grafana:grafana-dashboard        grafana_dashboard          regular  
prometheus:grafana-source                       grafana:grafana-source           grafana_datasource         regular  
prometheus:prometheus-peers                     prometheus:prometheus-peers      prometheus_peers           peer     
traefik:ingress                                 alertmanager:ingress             ingress                    regular  
traefik:ingress                                 catalogue:ingress                ingress                    regular  
traefik:ingress-per-unit                        loki:ingress                     ingress_per_unit           regular  
traefik:ingress-per-unit                        prometheus:ingress               ingress_per_unit           regular  
traefik:metrics-endpoint                        prometheus:metrics-endpoint      prometheus_scrape          regular  
traefik:peers                                   traefik:peers                    traefik_peers              peer     
traefik:traefik-route                           grafana:ingress                  traefik_route              regular  

[ceph model]

Model  Controller       Cloud/Region  Version  SLA          Timestamp
ceph   maas-controller  maas/default  3.1.6    unsupported  12:24:01Z

SAAS                                 Status  Store            URL
cos-alertmanager-karma-dashboard     active  maas-controller  admin/cos.alertmanager-karma-dashboard
cos-grafana-dashboards               active  maas-controller  admin/cos.grafana-dashboards
cos-loki-logging                     active  maas-controller  admin/cos.loki-logging
cos-prometheus-receive-remote-write  active  maas-controller  admin/cos.prometheus-receive-remote-write
cos-prometheus-scrape                active  maas-controller  admin/cos.prometheus-scrape
cos-prometheus-scrape-config-ceph    active  maas-controller  admin/cos.prometheus-scrape-config-ceph

App                Version  Status  Scale  Charm                   Channel        Rev  Exposed  Message
ceph-dashboard              active      3  ceph-dashboard          quincy/stable   48  no       Unit is ready
ceph-fs            17.2.6   active      1  ceph-fs                 quincy/stable   60  no       Unit is ready
ceph-iscsi                  active      2  ceph-iscsi              quincy/stable   27  no       Unit is ready
ceph-loadbalancer           active      1  openstack-loadbalancer  jammy/stable    10  no       Unit is ready
ceph-mon           17.2.6   active      3  ceph-mon                quincy/stable  194  no       Unit is ready and clustered
ceph-nfs                    active      1  ceph-nfs                quincy/stable    8  no       Unit is ready
ceph-osd           17.2.6   active      3  ceph-osd                quincy/stable  576  no       Unit is ready (2 OSD)
ceph-radosgw       17.2.6   active      1  ceph-radosgw            quincy/stable  564  no       Unit is ready
grafana-agent               active     11  grafana-agent           latest/edge     20  no       
ntp                4.2      active      3  ntp                     stable          50  no       chrony: Ready
vault              1.8.8    active      1  vault                   1.8/stable     183  no       Unit is ready (active: true, mlock: disabled)

Unit                  Workload  Agent  Machine  Public address   Ports     Message
ceph-fs/0*            active    idle   0/lxd/0  192.168.151.108            Unit is ready
  grafana-agent/12    active    idle            192.168.151.108            
ceph-iscsi/0*         active    idle   1        192.168.151.103            Unit is ready
ceph-iscsi/1          active    idle   2        192.168.151.104            Unit is ready
ceph-loadbalancer/0*  active    idle   1/lxd/0  192.168.151.107            Unit is ready
  grafana-agent/14    active    idle            192.168.151.107            
ceph-mon/0            active    idle   0/lxd/1  192.168.151.112            Unit is ready and clustered
  ceph-dashboard/2    active    idle            192.168.151.112            Unit is ready
  grafana-agent/13    active    idle            192.168.151.112            
ceph-mon/1            active    idle   1/lxd/1  192.168.151.110            Unit is ready and clustered
  ceph-dashboard/0*   active    idle            192.168.151.110            Unit is ready
  grafana-agent/9     active    idle            192.168.151.110            
ceph-mon/2*           active    idle   2/lxd/0  192.168.151.109            Unit is ready and clustered
  ceph-dashboard/1    active    idle            192.168.151.109            Unit is ready
  grafana-agent/7     active    idle            192.168.151.109            
ceph-nfs/0*           active    idle   1/lxd/2  192.168.151.111            Unit is ready
  grafana-agent/15    active    idle            192.168.151.111            
ceph-osd/0*           active    idle   0        192.168.151.102            Unit is ready (2 OSD)
  grafana-agent/2*    active    idle            192.168.151.102            
  ntp/1               active    idle            192.168.151.102  123/udp   chrony: Ready
ceph-osd/1            active    idle   1        192.168.151.103            Unit is ready (2 OSD)
  grafana-agent/1     active    idle            192.168.151.103            
  ntp/2               active    idle            192.168.151.103  123/udp   chrony: Ready
ceph-osd/2            active    idle   2        192.168.151.104            Unit is ready (2 OSD)
  grafana-agent/6     active    idle            192.168.151.104            
  ntp/0*              active    idle            192.168.151.104  123/udp   chrony: Ready
ceph-radosgw/0*       active    idle   0/lxd/2  192.168.151.113  80/tcp    Unit is ready
  grafana-agent/8     active    idle            192.168.151.113            
vault/0*              active    idle   2/lxd/1  192.168.151.106  8200/tcp  Unit is ready (active: true, mlock: disabled)
  grafana-agent/16    active    idle            192.168.151.106            

Machine  State    Address          Inst id              Base          AZ       Message
0        started  192.168.151.102  wise-fowl            ubuntu@22.04  default  Deployed
0/lxd/0  started  192.168.151.108  juju-194608-0-lxd-0  ubuntu@22.04  default  Container started
0/lxd/1  started  192.168.151.112  juju-194608-0-lxd-1  ubuntu@22.04  default  Container started
0/lxd/2  started  192.168.151.113  juju-194608-0-lxd-2  ubuntu@22.04  default  Container started
1        started  192.168.151.103  key-orca             ubuntu@22.04  default  Deployed
1/lxd/0  started  192.168.151.107  juju-194608-1-lxd-0  ubuntu@22.04  default  Container started
1/lxd/1  started  192.168.151.110  juju-194608-1-lxd-1  ubuntu@22.04  default  Container started
1/lxd/2  started  192.168.151.111  juju-194608-1-lxd-2  ubuntu@22.04  default  Container started
2        started  192.168.151.104  stable-roughy        ubuntu@22.04  default  Deployed
2/lxd/0  started  192.168.151.109  juju-194608-2-lxd-0  ubuntu@22.04  default  Container started
2/lxd/1  started  192.168.151.106  juju-194608-2-lxd-1  ubuntu@22.04  default  Container started

Integration provider                                      Requirer                                                    Interface                Type         Message
ceph-fs:juju-info                                         grafana-agent:juju-info                                     juju-info                subordinate  
ceph-iscsi:admin-access                                   ceph-dashboard:iscsi-dashboard                              ceph-iscsi-admin-access  regular      
ceph-iscsi:cluster                                        ceph-iscsi:cluster                                          ceph-iscsi-peer          peer         
ceph-loadbalancer:juju-info                               grafana-agent:juju-info                                     juju-info                subordinate  
ceph-loadbalancer:loadbalancer                            ceph-dashboard:loadbalancer                                 openstack-loadbalancer   regular      
ceph-mon:client                                           ceph-iscsi:ceph-client                                      ceph-client              regular      
ceph-mon:client                                           ceph-nfs:ceph-client                                        ceph-client              regular      
ceph-mon:dashboard                                        ceph-dashboard:dashboard                                    ceph-dashboard           subordinate  
ceph-mon:juju-info                                        grafana-agent:juju-info                                     juju-info                subordinate  
ceph-mon:mds                                              ceph-fs:ceph-mds                                            ceph-mds                 regular      
ceph-mon:metrics-endpoint                                 cos-prometheus-scrape-config-ceph:configurable-scrape-jobs  prometheus_scrape        regular      
ceph-mon:mon                                              ceph-mon:mon                                                ceph                     peer         
ceph-mon:osd                                              ceph-osd:mon                                                ceph-osd                 regular      
ceph-mon:radosgw                                          ceph-radosgw:mon                                            ceph-radosgw             regular      
ceph-nfs:cluster                                          ceph-nfs:cluster                                            ceph-nfs-peer            peer         
ceph-nfs:juju-info                                        grafana-agent:juju-info                                     juju-info                subordinate  
ceph-osd:juju-info                                        grafana-agent:juju-info                                     juju-info                subordinate  
ceph-osd:juju-info                                        ntp:juju-info                                               juju-info                subordinate  
ceph-radosgw:cluster                                      ceph-radosgw:cluster                                        swift-ha                 peer         
ceph-radosgw:juju-info                                    grafana-agent:juju-info                                     juju-info                subordinate  
cos-loki-logging:logging                                  grafana-agent:logging-consumer                              loki_push_api            regular      
cos-prometheus-receive-remote-write:receive-remote-write  grafana-agent:send-remote-write                             prometheus_remote_write  regular      
grafana-agent:grafana-dashboards-provider                 cos-grafana-dashboards:grafana-dashboard                    grafana_dashboard        regular      
grafana-agent:peers                                       grafana-agent:peers                                         grafana_agent_replica    peer         
ntp:ntp-peers                                             ntp:ntp-peers                                               ntp                      peer         
vault:certificates                                        ceph-dashboard:certificates                                 tls-certificates         regular      
vault:certificates                                        ceph-iscsi:certificates                                     tls-certificates         regular      
vault:certificates                                        ceph-radosgw:certificates                                   tls-certificates         regular      
vault:cluster                                             vault:cluster                                               vault-ha                 peer         
vault:juju-info                                           grafana-agent:juju-info                                     juju-info                subordinate  
nobuto-m commented 9 months ago
prometheus-k8s-operator (unique_identifiers=)$ git rev-parse HEAD
cd1615f21c3e38c6fc56356b95330186b0de5a32

$ charmcraft pack

$ juju refresh prometheus --path ./prometheus-k8s_ubuntu-20.04-amd64.charm

I only see one file per application although I suppose to see one rule file for grafana-agent host metrics + another rule file for a specific service.

# ls -l /etc/prometheus/rules/       
total 280
-rw-r--r-- 1 root root  55997 Nov 22 15:31 juju_ceph_57cd5e92_ceph-mon_metrics-endpoint_30.rules
-rw-r--r-- 1 root root  26899 Nov 22 15:31 juju_ceph_57cd5e92_ceph-osd.rules
-rw-r--r-- 1 root root  27715 Nov 22 15:31 juju_controller_1f525a2f_controller.rules
-rw-r--r-- 1 root root 151705 Nov 22 15:31 juju_cos-microk8s_c4865096_microk8s.rules
-rw-r--r-- 1 root root   2552 Nov 22 15:31 juju_cos_344a17da_alertmanager_metrics-endpoint_18.rules
-rw-r--r-- 1 root root   1416 Nov 22 15:31 juju_cos_344a17da_grafana_metrics-endpoint_20.rules
-rw-r--r-- 1 root root   3135 Nov 22 15:31 juju_cos_344a17da_loki_metrics-endpoint_19.rules
-rw-r--r-- 1 root root   1327 Nov 22 15:31 juju_cos_344a17da_traefik_metrics-endpoint_17.rules
dstathis commented 9 months ago

Can I see juju status --relations? I am a bit confused how this could have not worked.

dstathis commented 9 months ago

Could you also post juju debug-log?

nobuto-m commented 9 months ago

Can I see juju status --relations? I am a bit confused how this could have not worked.

Juju status is in https://github.com/canonical/prometheus-k8s-operator/issues/551#issuecomment-1822676148

nobuto-m commented 9 months ago

I'm confused. juju_ceph_57cd5e92_ceph-osd.rules doesn't follow the new syntax as identifier = f"{identifier}_{relation.name}_{relation.id}".

Are alert_rules from grafana-agent handled in a different place of the code?

I did some quick verbose logging and it didn't capture anything around ceph-osd.

$ git diff
diff --git a/lib/charms/prometheus_k8s/v0/prometheus_scrape.py b/lib/charms/prometheus_k8s/v0/prometheus_scrape.py
index 6392737..1530f73 100644
--- a/lib/charms/prometheus_k8s/v0/prometheus_scrape.py
+++ b/lib/charms/prometheus_k8s/v0/prometheus_scrape.py
@@ -1000,16 +1000,21 @@ class MetricsEndpointConsumer(Object):
         """
         alerts = {}  # type: Dict[str, dict] # mapping b/w juju identifiers and alert rule files
         for relation in self._charm.model.relations[self._relation_name]:
+            logger.warn(f"{relation.units = }")
+            logger.warn(f"{relation.app = }")
             if not relation.units or not relation.app:
                 continue

             alert_rules = json.loads(relation.data[relation.app].get("alert_rules", "{}"))
             if not alert_rules:
+                logger.warn("not alert_rules")
                 continue

             alert_rules = self._inject_alert_expr_labels(alert_rules)

             identifier, topology = self._get_identifier_by_alert_rules(alert_rules)
+            logger.warn(f"{identifier = }")
+            logger.warn(f"{topology = }")
             if not topology:
                 try:
                     scrape_metadata = json.loads(relation.data[relation.app]["scrape_metadata"])
@@ -1032,13 +1037,16 @@ class MetricsEndpointConsumer(Object):
             # We need to append the relation info to the identifier. This is to allow for cases for there are two
             # relations which eventually scrape the same application. Issue #551.
             identifier = f"{identifier}_{relation.name}_{relation.id}"
+            logger.warn(f"{identifier = }")

             alerts[identifier] = alert_rules

             _, errmsg = self._tool.validate_alert_rules(alert_rules)
+            logger.warn(f"{errmsg = }")
             if errmsg:
                 if alerts[identifier]:
                     del alerts[identifier]
+                    logger.warn("del alerts[identifier]")
                 if self._charm.unit.is_leader():
                     data = json.loads(relation.data[self._charm.app].get("event", "{}"))
                     data["errors"] = errmsg
unit-prometheus-0: 15:48:55 INFO unit.prometheus/0.juju-log Pushed new configuration
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit traefik/0>}
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application traefik>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_traefik'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3610>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_traefik_metrics-endpoint_17'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit alertmanager/0>}
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application alertmanager>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_alertmanager'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3580>
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_alertmanager_metrics-endpoint_18'
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit loki/0>}
unit-prometheus-0: 15:48:55 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application loki>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_loki'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3e20>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_loki_metrics-endpoint_19'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit grafana/0>}
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application grafana>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_grafana'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3cd0>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log identifier = 'cos_344a17da_grafana_metrics-endpoint_20'
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit cos-configuration-ceph/0>}
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application cos-configuration-ceph>
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log not alert_rules
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.units = {<ops.model.Unit prometheus-scrape-config-ceph/0>}
unit-prometheus-0: 15:48:56 WARNING unit.prometheus/0.juju-log relation.app = <ops.model.Application prometheus-scrape-config-ceph>
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log identifier = 'ceph_57cd5e92_ceph-mon'
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log topology = <cosl.juju_topology.JujuTopology object at 0x7f389b5c3340>
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log identifier = 'ceph_57cd5e92_ceph-mon_metrics-endpoint_30'
unit-prometheus-0: 15:48:57 WARNING unit.prometheus/0.juju-log errmsg = ''
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_traefik_metrics-endpoint_17.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_alertmanager_metrics-endpoint_18.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_loki_metrics-endpoint_19.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos_344a17da_grafana_metrics-endpoint_20.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_ceph_57cd5e92_ceph-mon_metrics-endpoint_30.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_controller_1f525a2f_controller.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_ceph_57cd5e92_ceph-osd.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Updated alert rules file juju_cos-microk8s_c4865096_microk8s.rules
unit-prometheus-0: 15:49:02 DEBUG unit.prometheus/0.juju-log Building pebble layer
nobuto-m commented 9 months ago

prometheus-0_debug.log

nobuto-m commented 9 months ago

Why were only ceph-osd alert_rules created with grafana-agent can be explained by https://github.com/canonical/grafana-agent-operator/issues/17 so it's a separate issue.

ceph-osd/0*           active    idle   0        192.168.151.102            Unit is ready (2 OSD)
  grafana-agent/2*    active    idle            192.168.151.102            
nobuto-m commented 9 months ago

tl;dr all good with the proposed patch as cd1615f21c3e38c6fc56356b95330186b0de5a32

Okay, the testing of the proposed patch was distracted by https://github.com/canonical/grafana-agent-operator/issues/17 so I redeployed my test bed and now I have the following as a dedicated grafana-agent-ceph-mon to properly reproduce the original issue.

Unit                         Workload  Agent  Machine  Public address   Ports  Message
ceph-mon/0                   active    idle   0/lxd/1  192.168.151.109         Unit is ready and clustered
  ceph-dashboard/2           active    idle            192.168.151.109         Unit is ready
  grafana-agent-ceph-mon/1   active    idle            192.168.151.109         
ceph-mon/1                   active    idle   1/lxd/1  192.168.151.111         Unit is ready and clustered
  ceph-dashboard/1           active    idle            192.168.151.111         Unit is ready
  grafana-agent-ceph-mon/2   active    idle            192.168.151.111         
ceph-mon/2*                  active    idle   2/lxd/0  192.168.151.108         Unit is ready and clustered
  ceph-dashboard/0*          active    idle            192.168.151.108         Unit is ready
  grafana-agent-ceph-mon/0*  active    idle            192.168.151.108         

[before patching]

root@prometheus-0:/# ls -1 /etc/prometheus/rules/
juju_ceph_456c92bf_ceph-mon.rules
juju_ceph_456c92bf_ceph-osd.rules
juju_controller_ef2e4e1a_controller.rules
juju_cos-microk8s_59d81459_microk8s.rules
juju_cos_41fa8883_alertmanager.rules
juju_cos_41fa8883_grafana.rules
juju_cos_41fa8883_loki.rules
juju_cos_41fa8883_traefik.rules

[after patching]

root@prometheus-0:/# ls -1 /etc/prometheus/rules/
juju_ceph_456c92bf_ceph-mon.rules
juju_ceph_456c92bf_ceph-mon_metrics-endpoint_30.rules
juju_ceph_456c92bf_ceph-osd.rules
juju_controller_ef2e4e1a_controller.rules
juju_cos-microk8s_59d81459_microk8s.rules
juju_cos_41fa8883_alertmanager_metrics-endpoint_18.rules
juju_cos_41fa8883_grafana_metrics-endpoint_20.rules
juju_cos_41fa8883_loki_metrics-endpoint_19.rules
juju_cos_41fa8883_traefik_metrics-endpoint_17.rules

[both alert rules from grafana-agent for host metrics and ceph-mon for Ceph service are visible now with separate file names] image