canonical / alertmanager-k8s-operator

This charmed operator automates operation procedures of Alertmanager, the alerting component of Prometheus and Loki, among others.
https://charmhub.io/alertmanager-k8s
Apache License 2.0
4 stars 16 forks source link

AlertmanagerNotificationsFailed is fired continuously due to integration=webhook #237

Closed nobuto-m closed 3 months ago

nobuto-m commented 3 months ago

Bug Description

The AlertmanagerNotificationsFailed is fired out of the box.

Screenshot from 2024-03-18 11-27-54 Screenshot from 2024-03-18 11-27-33

To Reproduce

  1. juju deploy cos-lite --trust --channel latest/edge
  2. relate COS with workload

Environment

App                            Version  Status  Scale  Charm                         Channel      Rev
alertmanager                   0.26.0   active      1  alertmanager-k8s              edge         105
catalogue                               active      1  catalogue-k8s                 edge          33
cos-configuration-ceph         3.5.0    active      1  cos-configuration-k8s         latest/edge   47
grafana                        9.5.3    active      1  grafana-k8s                   edge         106
loki                           2.9.5    active      1  loki-k8s                      edge         125
prometheus                     2.49.1   active      1  prometheus-k8s                edge         171
prometheus-scrape-config-ceph  n/a      active      1  prometheus-scrape-config-k8s  latest/edge   47
traefik                        2.10.5   active      1  traefik-k8s                   edge         174

Relevant log output

2024-03-17T15:34:15.139Z [alertmanager] ts=2024-03-17T15:34:15.139Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=3 err="placeholder/webhook[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-17T15:34:15.139Z [alertmanager] ts=2024-03-17T15:34:15.139Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-17T15:34:15.140Z [alertmanager] ts=2024-03-17T15:34:15.139Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="placeholder/webhook[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-17T15:34:15.140Z [alertmanager] ts=2024-03-17T15:34:15.140Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"microk8s\", juju_model=\"cos-microk8s\", juju_model_uuid=\"b96b05ee-afa6-46fd-8ec7-02ca7528a5d9\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-17T15:34:15.140Z [alertmanager] ts=2024-03-17T15:34:15.140Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-17T15:34:15.140Z [alertmanager] ts=2024-03-17T15:34:15.140Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"ceph-mon\", juju_model=\"ceph\", juju_model_uuid=\"a23fcb4b-992d-40bd-820b-6ac0f69db5e2\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"

pod_alertmanager-0.log

Additional context

No response

nobuto-m commented 3 months ago

Looks like those failures are every 5 min and it matches with the interval of the update-status hook.

pods.log

nobuto-m commented 3 months ago

Hmm, scratch that. dial tcp 127.0.0.1:5001: connect: connection refused is still happening every 5 min even after setting update-status-hook-interval=30m.

2024-03-18T13:38:41.800801283Z stdout F 2024-03-18T13:38:41.800Z [alertmanager] ts=2024-03-18T13:38:41.800Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:38:42.215479478Z stdout F 2024-03-18T13:38:42.215Z [container-agent] 2024-03-18 13:38:42 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-03-18T13:43:41.8041126Z stdout F 2024-03-18T13:43:41.803Z [alertmanager] ts=2024-03-18T13:43:41.803Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:43:41.804073306Z stdout F 2024-03-18T13:43:41.803Z [alertmanager] ts=2024-03-18T13:43:41.803Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:43:41.804506534Z stdout F 2024-03-18T13:43:41.804Z [alertmanager] ts=2024-03-18T13:43:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:43:41.804520861Z stdout F 2024-03-18T13:43:41.804Z [alertmanager] ts=2024-03-18T13:43:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"microk8s\", juju_model=\"cos-microk8s\", juju_model_uuid=\"b96b05ee-afa6-46fd-8ec7-02ca7528a5d9\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.8047974Z stdout F 2024-03-18T13:48:41.804Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="placeholder/webhook[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.804756764Z stdout F 2024-03-18T13:48:41.804Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.805093239Z stdout F 2024-03-18T13:48:41.805Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"microk8s\", juju_model=\"cos-microk8s\", juju_model_uuid=\"b96b05ee-afa6-46fd-8ec7-02ca7528a5d9\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.805115331Z stdout F 2024-03-18T13:48:41.805Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.8056064Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.80602439Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"microk8s\", juju_model=\"cos-microk8s\", juju_model_uuid=\"b96b05ee-afa6-46fd-8ec7-02ca7528a5d9\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.805644461Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="placeholder/webhook[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.805995605Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
nobuto-m commented 3 months ago

oh...

root@alertmanager-0:/# cat /etc/alertmanager/alertmanager.yml 
global:
  http_config:
    tls_config:
      insecure_skip_verify: false
receivers:
- name: placeholder
  webhook_configs:
  - url: http://127.0.0.1:5001/
route:
  group_by:
  - juju_application
  - juju_model_uuid
  - juju_model
  group_interval: 5m
  group_wait: 30s
  receiver: placeholder
  repeat_interval: 1h
sed-i commented 3 months ago

Hi @nobuto-m, Yes, this is coming from the placeholder receiver. Alertmanager won't start without this config. You would need to provide your own "real" config via a charm config option.

nobuto-m commented 3 months ago

How exactly? I didn't see a relevant topic in the documentation and config. https://charmhub.io/topics/canonical-observability-stack https://charmhub.io/alertmanager-k8s/configuration

simskij commented 3 months ago

How exactly? I didn't see a relevant topic in the documentation and config. https://charmhub.io/topics/canonical-observability-stack https://charmhub.io/alertmanager-k8s/configuration

It's linked in the description of the config_file property on the second page you linked. https://www.prometheus.io/docs/alerting/latest/configuration/

nobuto-m commented 3 months ago

I mean do operators have to write the whole config of alertmanager.yml just to specify where to send alerts? Do they have to know the following trick without documentation?

  group_by:
  - juju_application
  - juju_model_uuid
  - juju_model
simskij commented 3 months ago

I mean do operators have to write the whole config of alertmanager.yml just to specify where to send alerts? Do they have to know the following trick without documentation?


  group_by:

  - juju_application

  - juju_model_uuid

  - juju_model

Yes, that's how it works. As for the group by, this is injected automatically without the user needing to supply it.

We are looking to provide some common config examples in the docs in the future, but atm that's how it is.