canonical / prometheus-k8s-operator

This charmed operator automates the operational procedures of running Prometheus, an open-source metrics backend.
https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 34 forks source link

Alerts gone after pod reschedule #609

Closed sed-i closed 3 months ago

sed-i commented 3 months ago

Bug Description

While testing https://github.com/canonical/cos-proxy-operator/pull/136, prometheus pod got rescheduled, and all alerts were gone.

juju show-unit has alerts:

prom/0:
  workload-version: 2.50.1
  opened-ports: []
  charm: ch:amd64/focal/prometheus-k8s-189
  leader: true
  life: alive
  relation-info:
  - relation-id: 6
    endpoint: metrics-endpoint
    cross-model: true
    related-endpoint: downstream-prometheus-scrape
    application-data:
      alert_rules: '{"groups": [{"name": "juju_luca_30d8d8d_ubuntu_0_alert_rules",
        "rules": [{"alert": "CheckConntrackNrpeAlert", "expr": "avg_over_time(command_status{juju_unit=''ubuntu/0'',command=''check_conntrack''}[15m])
        > 1 or (absent_over_time(command_status{juju_unit=''ubuntu/0'',command=''check_conntrack''}[10m])
# ...

and the alerts are indeed on disk:

❯ jsshc prometheus prom/0 cat /etc/prometheus/rules/juju_juju_luca_30d8d8d_ubuntu_0_alert_rules_metrics-endpoint_6.rules
groups:
- name: juju_luca_30d8d8d_ubuntu_0_alert_rules
  rules:
  - alert: CheckConntrackNrpeAlert
    annotations:
      description: "Check provided by nrpe_exporter in model {{ $labels.juju_model\
        \ }} is failing.\nFailing check = {{ $labels.command }}\nUnit = {{ $labels.juju_unit\
# ...

but the webui shows nothing:

❯ curl 10.1.166.113:9090/api/v1/alerts
{"status":"success","data":{"alerts":[]}}

kubectl logs seem fine:

2024-05-28T13:30:04.558Z [prometheus] ts=2024-05-28T13:30:04.558Z caller=main.go:1324 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
2024-05-28T13:30:04.559Z [prometheus] ts=2024-05-28T13:30:04.559Z caller=main.go:1361 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=1.795905ms db_storage=780ns remote_storage=670ns web_handler=160ns query_engine=520ns scrape=155.801µs scrape_sd=38.728µs notify=1.13µs notify_sd=510ns rules=1.142212ms tracing=5.54µs
2024-05-28T13:30:04.559Z [prometheus] ts=2024-05-28T13:30:04.559Z caller=main.go:1103 level=info msg="Server is ready to receive web requests."

To Reproduce

Deploy this bundle in a k8s model:

bundle: kubernetes
saas:
  remote-2c369e446bcc43d58ca9ab95eb920c5c: {}
applications:
  prom:
    charm: prometheus-k8s
    channel: latest/edge
    revision: 189
    resources:
      prometheus-image: 144
    scale: 1
    trust: true
--- # overlay.yaml
applications:
  prom:
    offers:
      prom:
        endpoints:
        - metrics-endpoint
        - receive-remote-write
        acl:
          admin: admin

And this bundle in a lxd model:

series: jammy
saas:
  prom:
    url: j34:admin/pg.prom
applications:
  cos-proxy:
    charm: cos-proxy
    channel: latest/edge
    num_units: 1
  nrpe:
    charm: nrpe
    channel: latest/edge
    revision: 122
  ubuntu:
    charm: ubuntu
    channel: stable
    revision: 24
    num_units: 1
relations:
- - ubuntu:juju-info
  - nrpe:general-info
- - cos-proxy:monitors
  - nrpe:monitors
- - cos-proxy:downstream-prometheus-scrape
  - prom:metrics-endpoint

Then refresh cos-proxy.

Environment

Controller  Model  User   Access     Cloud/Region         Models  Nodes    HA  Version
j34         pg     admin  superuser  microk8s/localhost        2      1     -  3.4.2  
lxd2*       luca   admin  superuser  localhost/localhost       2      1  none  3.1.7  

Relevant log output

❯ jssl prom/0 --days=11
Time                        Type       Status       Message
28 May 2024 09:12:52-04:00  juju-unit  allocating   
28 May 2024 09:12:52-04:00  workload   waiting      installing agent
28 May 2024 09:13:17-04:00  workload   waiting      agent initialising
28 May 2024 09:13:50-04:00  workload   maintenance  installing charm software
28 May 2024 09:13:50-04:00  juju-unit  executing    running install hook
28 May 2024 09:13:54-04:00  juju-unit  executing    running prometheus-peers-relation-created hook
28 May 2024 09:13:56-04:00  juju-unit  executing    running leader-elected hook
28 May 2024 09:13:59-04:00  workload   active       
28 May 2024 09:14:00-04:00  juju-unit  executing    running prometheus-pebble-ready hook
28 May 2024 09:14:03-04:00  juju-unit  executing    running database-storage-attached hook
28 May 2024 09:14:05-04:00  juju-unit  executing    running config-changed hook
28 May 2024 09:14:09-04:00  workload   waiting      Waiting for resource limit patch to apply
28 May 2024 09:14:59-04:00  juju-unit  error        hook failed: "start"
28 May 2024 09:15:04-04:00  juju-unit  executing    running start hook
28 May 2024 09:15:08-04:00  juju-unit  executing    running prometheus-pebble-ready hook
28 May 2024 09:15:21-04:00  juju-unit  idle         
28 May 2024 09:15:51-04:00  juju-unit  executing    running metrics-endpoint-relation-created hook
28 May 2024 09:15:54-04:00  juju-unit  executing    running metrics-endpoint-relation-changed hook
28 May 2024 09:15:56-04:00  juju-unit  executing    running metrics-endpoint-relation-joined hook for remote-2c369e446bcc43d58ca9ab95eb920c5c/0
28 May 2024 09:15:58-04:00  juju-unit  executing    running metrics-endpoint-relation-changed hook for remote-2c369e446bcc43d58ca9ab95eb920c5c/0
28 May 2024 09:16:01-04:00  juju-unit  executing    running metrics-endpoint-relation-changed hook
28 May 2024 09:16:03-04:00  juju-unit  idle         
28 May 2024 09:19:22-04:00  workload   active       
28 May 2024 09:22:23-04:00  workload   maintenance  stopping charm software
28 May 2024 09:22:23-04:00  juju-unit  executing    running stop hook
28 May 2024 09:22:24-04:00  workload   active       
28 May 2024 09:22:27-04:00  juju-unit  idle         
28 May 2024 09:22:27-04:00  workload   maintenance  
28 May 2024 09:30:01-04:00  juju-unit  executing    running upgrade-charm hook
28 May 2024 09:30:06-04:00  juju-unit  executing    running config-changed hook
28 May 2024 09:30:09-04:00  juju-unit  executing    running start hook
28 May 2024 09:30:11-04:00  juju-unit  executing    running prometheus-pebble-ready hook
28 May 2024 09:30:14-04:00  juju-unit  executing    running metrics-endpoint-relation-changed hook
28 May 2024 09:30:29-04:00  juju-unit  idle         
28 May 2024 23:32:26-04:00  workload   active

Additional context

No response

sed-i commented 3 months ago

Oops, was looking at api/v1/alerts instead of api/v1/rules. Closing!