canonical / prometheus-k8s-operator

This charmed operator automates the operational procedures of running Prometheus, an open-source metrics backend.
https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 34 forks source link

Alert rules embedded in charmed operators must not have the juju_unit added to the topology #238

Closed dstathis closed 2 years ago

dstathis commented 2 years ago

Bug Description

When using the prometheus_scrape library, the unit should not be added as a label. There should only be one set of alerts which act on the whole application.

To Reproduce

Use version 0.16 of the prometheus_scrape library and make sure to include alert rules.

Environment

juju+k8s

Relevant log output

dylan@cerritos:~$ juju show-unit prometheus-k8s/0
prometheus-k8s/0:
  opened-ports: []
  charm: ch:amd64/focal/prometheus-k8s-19
  leader: true
  relation-info:
  - endpoint: prometheus-peers
    related-endpoint: prometheus-peers
    application-data: {}
    local-unit:
      in-scope: true
      data:
        egress-subnets: 10.152.183.213/32
        ingress-address: 10.152.183.213
        private-address: 10.152.183.213
  - endpoint: metrics-endpoint
    related-endpoint: metrics-endpoint
    application-data:
      alert_rules: '{"groups": [{"name": "test_b411f591-160b-46b6-826d-93c55be76b35_traefik-k8s_traefik-k8s_0_traefik-k8s_unit_unavailable_alerts",
        "rules": [{"alert": "TraefikIngressUnitIsUnavailable", "expr": "up{juju_model=\"test\",
        juju_model_uuid=\"b411f591-160b-46b6-826d-93c55be76b35\", juju_application=\"traefik-k8s\",
        juju_unit=\"traefik-k8s/0\", juju_charm=\"traefik-k8s\"} < 1", "for": "0m",
        "labels": {"severity": "critical", "juju_model": "test", "juju_model_uuid":
        "b411f591-160b-46b6-826d-93c55be76b35", "juju_application": "traefik-k8s",
        "juju_unit": "traefik-k8s/0", "juju_charm": "traefik-k8s"}, "annotations":
        {"summary": "Traefik ingress unit {{ $labels.juju_model }}/{{ $labels.juju_unit
        }} unavailable", "description": "The Traefik ingress unit {{ $labels.juju_model
        }} {{ $labels.juju_unit }} is unavailable LABELS = {{ $labels }}\n"}}]}]}'
      scrape_jobs: '[{"metrics_path": "/metrics", "static_configs": [{"targets": ["*:8082"]}]}]'
      scrape_metadata: '{"model": "test", "model_uuid": "b411f591-160b-46b6-826d-93c55be76b35",
        "application": "traefik-k8s", "unit": "traefik-k8s/0", "charm_name": "traefik-k8s"}'
    related-units:
      traefik-k8s/0:
        in-scope: true
        data:
          egress-subnets: 10.152.183.25/32
          ingress-address: 10.152.183.25
          private-address: 10.152.183.25
          prometheus_scrape_unit_address: 10.1.250.213
          prometheus_scrape_unit_name: traefik-k8s/0
  provider-id: prometheus-k8s-0
  address: 10.1.250.214

Additional context

No response

mmanciop commented 2 years ago

Note: if a charm author wants to specify logic that will be evaluated over multiple units, like average CPU usage, they will use aggregator operators.

Alert rules are evaluated per timeseries. A timeseries is identified by all its labels. The unit is already in the timeseries, unless the query uses a functionality like without.

mmanciop commented 2 years ago

Also, note that our end-to-end tests are likely using single units, so the failure of alerts to be raised for non-leader units went unnoticed.

balbirthomas commented 2 years ago

A fix has been merged into the git repository and is currently under testing.

balbirthomas commented 2 years ago

@dstathis can this ticket be closed ?