Closed nobuto-m closed 3 months ago
Looks like those failures are every 5 min and it matches with the interval of the update-status hook.
Hmm, scratch that. dial tcp 127.0.0.1:5001: connect: connection refused
is still happening every 5 min even after setting update-status-hook-interval=30m.
2024-03-18T13:38:41.800801283Z stdout F 2024-03-18T13:38:41.800Z [alertmanager] ts=2024-03-18T13:38:41.800Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:38:42.215479478Z stdout F 2024-03-18T13:38:42.215Z [container-agent] 2024-03-18 13:38:42 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-03-18T13:43:41.8041126Z stdout F 2024-03-18T13:43:41.803Z [alertmanager] ts=2024-03-18T13:43:41.803Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:43:41.804073306Z stdout F 2024-03-18T13:43:41.803Z [alertmanager] ts=2024-03-18T13:43:41.803Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:43:41.804506534Z stdout F 2024-03-18T13:43:41.804Z [alertmanager] ts=2024-03-18T13:43:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:43:41.804520861Z stdout F 2024-03-18T13:43:41.804Z [alertmanager] ts=2024-03-18T13:43:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"microk8s\", juju_model=\"cos-microk8s\", juju_model_uuid=\"b96b05ee-afa6-46fd-8ec7-02ca7528a5d9\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.8047974Z stdout F 2024-03-18T13:48:41.804Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="placeholder/webhook[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.804756764Z stdout F 2024-03-18T13:48:41.804Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.805093239Z stdout F 2024-03-18T13:48:41.805Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"microk8s\", juju_model=\"cos-microk8s\", juju_model_uuid=\"b96b05ee-afa6-46fd-8ec7-02ca7528a5d9\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:48:41.805115331Z stdout F 2024-03-18T13:48:41.805Z [alertmanager] ts=2024-03-18T13:48:41.804Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.8056064Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="placeholder/webhook[0]: notify retry canceled after 16 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.80602439Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"microk8s\", juju_model=\"cos-microk8s\", juju_model_uuid=\"b96b05ee-afa6-46fd-8ec7-02ca7528a5d9\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.805644461Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="placeholder/webhook[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
2024-03-18T13:53:41.805995605Z stdout F 2024-03-18T13:53:41.805Z [alertmanager] ts=2024-03-18T13:53:41.805Z caller=notify.go:745 level=warn component=dispatcher receiver=placeholder integration=webhook[0] aggrGroup="{}:{juju_application=\"alertmanager\", juju_model=\"cos\", juju_model_uuid=\"4ccf0ff7-981f-45eb-86d9-4c6f0b922527\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": dial tcp 127.0.0.1:5001: connect: connection refused"
oh...
root@alertmanager-0:/# cat /etc/alertmanager/alertmanager.yml
global:
http_config:
tls_config:
insecure_skip_verify: false
receivers:
- name: placeholder
webhook_configs:
- url: http://127.0.0.1:5001/
route:
group_by:
- juju_application
- juju_model_uuid
- juju_model
group_interval: 5m
group_wait: 30s
receiver: placeholder
repeat_interval: 1h
Hi @nobuto-m, Yes, this is coming from the placeholder receiver. Alertmanager won't start without this config. You would need to provide your own "real" config via a charm config option.
How exactly? I didn't see a relevant topic in the documentation and config. https://charmhub.io/topics/canonical-observability-stack https://charmhub.io/alertmanager-k8s/configuration
How exactly? I didn't see a relevant topic in the documentation and config. https://charmhub.io/topics/canonical-observability-stack https://charmhub.io/alertmanager-k8s/configuration
It's linked in the description of the config_file property on the second page you linked. https://www.prometheus.io/docs/alerting/latest/configuration/
I mean do operators have to write the whole config of alertmanager.yml just to specify where to send alerts? Do they have to know the following trick without documentation?
group_by:
- juju_application
- juju_model_uuid
- juju_model
I mean do operators have to write the whole config of alertmanager.yml just to specify where to send alerts? Do they have to know the following trick without documentation?
group_by: - juju_application - juju_model_uuid - juju_model
Yes, that's how it works. As for the group by, this is injected automatically without the user needing to supply it.
We are looking to provide some common config examples in the docs in the future, but atm that's how it is.
Bug Description
The AlertmanagerNotificationsFailed is fired out of the box.
To Reproduce
Environment
Relevant log output
pod_alertmanager-0.log
Additional context
No response