canonical / alertmanager-k8s-operator

This charmed operator automates operation procedures of Alertmanager, the alerting component of Prometheus and Loki, among others.
https://charmhub.io/alertmanager-k8s
Apache License 2.0
4 stars 16 forks source link

pebble restarting alertmanager every ~5 minutes #233

Open sajoupa opened 4 months ago

sajoupa commented 4 months ago

Bug Description

After deploying cos-alerter, I was investigating why alertmanager was sending Watchdog alerts very erratically. I looked at the alertmanager logs, and found that it was constantly restarted by pebble (see the logs below).

To Reproduce

I did not initially deploy this COS cluster, nor do I have a k8s cluster at hand to reproduce. If it helps, here is the alertmanager config:

$ juju g -a alertmanager
juju config alertmanager config_file='receivers:
- name: cos-alerter
  webhook_configs:
  - url: http://<COS_ALERTER_IP>:8080/alive?clientid=<CLUSTER_ID>&key=<REDACTED>

route:
  routes:
    - matchers:
      - alertname="Watchdog"
      - juju_charm="alertmanager-k8s"
  receiver: cos-alerter
'
juju config alertmanager cpu='' #Default#
juju config alertmanager juju-application-path='/' #Default#
juju config alertmanager juju-external-hostname='' #Default#
juju config alertmanager kubernetes-ingress-allow-http='False' #Default#
juju config alertmanager kubernetes-ingress-class='nginx' #Default#
juju config alertmanager kubernetes-ingress-ssl-passthrough='False' #Default#
juju config alertmanager kubernetes-ingress-ssl-redirect='False' #Default#
juju config alertmanager kubernetes-service-annotations='' #Default#
juju config alertmanager kubernetes-service-external-ips='' #Default#
juju config alertmanager kubernetes-service-externalname='' #Default#
juju config alertmanager kubernetes-service-loadbalancer-ip='' #Default#
juju config alertmanager kubernetes-service-loadbalancer-sourceranges='' #Default#
juju config alertmanager kubernetes-service-target-port='' #Default#
juju config alertmanager kubernetes-service-type='' #Default#
juju config alertmanager memory='' #Default#
juju config alertmanager templates_file='' #Default#
juju config alertmanager trust='True'
juju config alertmanager web_external_url='' #Default#

Environment

The COS bundle runs on a microk8s cluster. The charms are the latest/stable versions.

Relevant log output

2024-03-08T09:55:19.483Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 70.653µs 404
2024-03-08T09:55:19.503Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 146.295µs 404
2024-03-08T09:55:19.514Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 72.145µs 404
2024-03-08T09:55:20.387Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 101.049µs 404
2024-03-08T09:55:20.443Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 143.089µs 404
2024-03-08T09:55:20.545Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 131.639µs 404
2024-03-08T09:55:20.615Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 75.052µs 404
2024-03-08T09:55:20.678Z [pebble] POST /v1/files 9.761966ms 200
2024-03-08T09:55:20.686Z [pebble] POST /v1/files 6.63311ms 200
2024-03-08T09:55:20.686Z [pebble] POST /v1/files 64.852µs 200
2024-03-08T09:55:20.692Z [pebble] POST /v1/files 5.479489ms 200
2024-03-08T09:55:20.692Z [pebble] POST /v1/files 55.415µs 200
2024-03-08T09:55:20.698Z [pebble] POST /v1/files 5.435406ms 200
2024-03-08T09:55:20.698Z [pebble] POST /v1/files 46.487µs 200
2024-03-08T09:55:20.745Z [pebble] POST /v1/exec 45.532637ms 202
2024-03-08T09:55:20.788Z [pebble] GET /v1/tasks/9298/websocket/control 42.379446ms 200
2024-03-08T09:55:20.789Z [pebble] GET /v1/tasks/9298/websocket/stdio 71.586µs 200
2024-03-08T09:55:20.789Z [pebble] GET /v1/tasks/9298/websocket/stderr 55.484µs 200
2024-03-08T09:55:20.895Z [pebble] GET /v1/changes/6974/wait?timeout=4.000s 104.877935ms 200
2024-03-08T09:55:20.897Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 38.281µs 404
2024-03-08T09:55:20.898Z [pebble] POST /v1/layers 267.033µs 200
2024-03-08T09:55:20.943Z [pebble] POST /v1/services 44.59849ms 202
2024-03-08T09:55:21.073Z [pebble] GET /v1/changes/6975/wait?timeout=4.000s 128.660168ms 200
2024-03-08T09:55:26.941Z [pebble] GET /v1/plan?format=yaml 157.276µs 200
2024-03-08T09:55:26.987Z [pebble] POST /v1/services 44.899247ms 202
2024-03-08T09:55:27.032Z [alertmanager] ts=2024-03-08T09:55:27.031Z caller=main.go:594 level=info msg="Received SIGTERM, exiting gracefully..."
2024-03-08T09:55:27.039Z [pebble] Service "alertmanager" stopped
2024-03-08T09:55:27.128Z [pebble] Service "alertmanager" starting: alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager --web.listen-address=:9093 --cluster.listen-address= --web.external-url=http://alertmanager-0.alertmanager-endpoints.<cluster-name>.svc.cluster.local:9093
2024-03-08T09:55:27.144Z [alertmanager] ts=2024-03-08T09:55:27.144Z caller=main.go:245 level=info msg="Starting Alertmanager" version="(version=0.26.0, branch=HEAD, revision=d7b4f0c7)"
2024-03-08T09:55:27.144Z [alertmanager] ts=2024-03-08T09:55:27.144Z caller=main.go:246 level=info build_context="(go=go1.18.10, platform=linux/amd64, user=root@rockcraft-alertmanager-on-amd64-for-amd64-797928, date=2024-02-13T14:16:29Z, tags=unknown)"
2024-03-08T09:55:27.173Z [alertmanager] ts=2024-03-08T09:55:27.173Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
2024-03-08T09:55:27.174Z [alertmanager] ts=2024-03-08T09:55:27.174Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
2024-03-08T09:55:27.177Z [alertmanager] ts=2024-03-08T09:55:27.177Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9093
2024-03-08T09:55:27.177Z [alertmanager] ts=2024-03-08T09:55:27.177Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9093
2024-03-08T09:55:28.170Z [pebble] GET /v1/changes/6976/wait?timeout=4.000s 1.182784446s 200
2024-03-08T09:59:23.949Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 45.876µs 404
2024-03-08T09:59:23.963Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 39.204µs 404
2024-03-08T09:59:23.970Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 34.515µs 404
2024-03-08T09:59:29.684Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 42.861µs 404
2024-03-08T09:59:29.718Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 48.901µs 404
2024-03-08T09:59:29.781Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 50.245µs 404
2024-03-08T09:59:29.838Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 44.284µs 404
2024-03-08T09:59:29.884Z [pebble] POST /v1/files 6.790889ms 200
2024-03-08T09:59:29.891Z [pebble] POST /v1/files 6.069903ms 200
2024-03-08T09:59:29.891Z [pebble] POST /v1/files 37.4µs 200
2024-03-08T09:59:29.897Z [pebble] POST /v1/files 5.120127ms 200
2024-03-08T09:59:29.897Z [pebble] POST /v1/files 45.776µs 200
2024-03-08T09:59:29.903Z [pebble] POST /v1/files 5.680761ms 200
2024-03-08T09:59:29.904Z [pebble] POST /v1/files 51.216µs 200
2024-03-08T09:59:29.998Z [pebble] POST /v1/exec 93.772776ms 202
2024-03-08T09:59:29.999Z [pebble] GET /v1/tasks/9302/websocket/control 139.462µs 200
2024-03-08T09:59:30.000Z [pebble] GET /v1/tasks/9302/websocket/stdio 65.343µs 200
2024-03-08T09:59:30.000Z [pebble] GET /v1/tasks/9302/websocket/stderr 50.065µs 200
2024-03-08T09:59:30.097Z [pebble] GET /v1/changes/6977/wait?timeout=4.000s 95.534533ms 200
2024-03-08T09:59:30.099Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 39.754µs 404
2024-03-08T09:59:30.100Z [pebble] POST /v1/layers 223.682µs 200
2024-03-08T09:59:30.189Z [pebble] POST /v1/services 87.501436ms 202
2024-03-08T09:59:30.280Z [pebble] GET /v1/changes/6978/wait?timeout=4.000s 90.123621ms 200
2024-03-08T09:59:31.147Z [pebble] GET /v1/plan?format=yaml 233.52µs 200
2024-03-08T09:59:31.239Z [pebble] POST /v1/services 91.600339ms 202
2024-03-08T09:59:31.240Z [alertmanager] ts=2024-03-08T09:59:31.240Z caller=main.go:594 level=info msg="Received SIGTERM, exiting gracefully..."
2024-03-08T09:59:31.248Z [pebble] Service "alertmanager" stopped
2024-03-08T09:59:31.352Z [pebble] Service "alertmanager" starting: alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager --web.listen-address=:9093 --cluster.listen-address= --web.external-url=http://alertmanager-0.alertmanager-endpoints.<cluster-name>.svc.cluster.local:9093
2024-03-08T09:59:31.366Z [alertmanager] ts=2024-03-08T09:59:31.366Z caller=main.go:245 level=info msg="Starting Alertmanager" version="(version=0.26.0, branch=HEAD, revision=d7b4f0c7)"
2024-03-08T09:59:31.366Z [alertmanager] ts=2024-03-08T09:59:31.366Z caller=main.go:246 level=info build_context="(go=go1.18.10, platform=linux/amd64, user=root@rockcraft-alertmanager-on-amd64-for-amd64-797928, date=2024-02-13T14:16:29Z, tags=unknown)"
2024-03-08T09:59:31.391Z [alertmanager] ts=2024-03-08T09:59:31.391Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
2024-03-08T09:59:31.391Z [alertmanager] ts=2024-03-08T09:59:31.391Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
2024-03-08T09:59:31.394Z [alertmanager] ts=2024-03-08T09:59:31.394Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9093
2024-03-08T09:59:31.394Z [alertmanager] ts=2024-03-08T09:59:31.394Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9093
2024-03-08T09:59:32.392Z [pebble] GET /v1/changes/6979/wait?timeout=4.000s 1.151803833s 200
2024-03-08T10:04:54.987Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 53.309µs 404
2024-03-08T10:04:55.002Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 93.055µs 404
2024-03-08T10:04:55.010Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 57.58µs 404
2024-03-08T10:05:00.798Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 55.334µs 404
2024-03-08T10:05:00.841Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 75.532µs 404
2024-03-08T10:05:00.930Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 60.214µs 404
2024-03-08T10:05:01.004Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 45.505µs 404
2024-03-08T10:05:01.056Z [pebble] POST /v1/files 8.33658ms 200
2024-03-08T10:05:01.079Z [pebble] POST /v1/files 23.099067ms 200
2024-03-08T10:05:01.080Z [pebble] POST /v1/files 49.173µs 200
2024-03-08T10:05:01.091Z [pebble] POST /v1/files 10.452614ms 200
2024-03-08T10:05:01.091Z [pebble] POST /v1/files 38.662µs 200
2024-03-08T10:05:01.115Z [pebble] POST /v1/files 23.346963ms 200
2024-03-08T10:05:01.116Z [pebble] POST /v1/files 69.56µs 200
2024-03-08T10:05:01.162Z [pebble] POST /v1/exec 45.465995ms 202
2024-03-08T10:05:01.207Z [pebble] GET /v1/tasks/9306/websocket/control 43.880841ms 200
2024-03-08T10:05:01.207Z [pebble] GET /v1/tasks/9306/websocket/stdio 74.14µs 200
2024-03-08T10:05:01.208Z [pebble] GET /v1/tasks/9306/websocket/stderr 37.782µs 200
2024-03-08T10:05:01.318Z [pebble] GET /v1/changes/6980/wait?timeout=4.000s 108.934344ms 200
2024-03-08T10:05:01.320Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Falertmanager%2Falertmanager.cert.pem&itself=true 41.798µs 404
2024-03-08T10:05:01.321Z [pebble] POST /v1/layers 242.957µs 200
2024-03-08T10:05:01.366Z [pebble] POST /v1/services 44.264083ms 202
2024-03-08T10:05:01.498Z [pebble] GET /v1/changes/6981/wait?timeout=4.000s 131.152483ms 200
2024-03-08T10:05:02.364Z [pebble] GET /v1/plan?format=yaml 183.685µs 200
2024-03-08T10:05:02.409Z [pebble] POST /v1/services 44.809519ms 202
2024-03-08T10:05:02.455Z [alertmanager] ts=2024-03-08T10:05:02.455Z caller=main.go:594 level=info msg="Received SIGTERM, exiting gracefully..."
2024-03-08T10:05:02.460Z [pebble] Service "alertmanager" stopped
2024-03-08T10:05:02.552Z [pebble] Service "alertmanager" starting: alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/alertmanager --web.listen-address=:9093 --cluster.listen-address= --web.external-url=http://alertmanager-0.alertmanager-endpoints.<cluster-name>.svc.cluster.local:9093
2024-03-08T10:05:02.571Z [alertmanager] ts=2024-03-08T10:05:02.571Z caller=main.go:245 level=info msg="Starting Alertmanager" version="(version=0.26.0, branch=HEAD, revision=d7b4f0c7)"
2024-03-08T10:05:02.571Z [alertmanager] ts=2024-03-08T10:05:02.571Z caller=main.go:246 level=info build_context="(go=go1.18.10, platform=linux/amd64, user=root@rockcraft-alertmanager-on-amd64-for-amd64-797928, date=2024-02-13T14:16:29Z, tags=unknown)"
2024-03-08T10:05:02.603Z [alertmanager] ts=2024-03-08T10:05:02.603Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
2024-03-08T10:05:02.604Z [alertmanager] ts=2024-03-08T10:05:02.604Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
2024-03-08T10:05:02.610Z [alertmanager] ts=2024-03-08T10:05:02.610Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9093
2024-03-08T10:05:02.610Z [alertmanager] ts=2024-03-08T10:05:02.610Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9093
2024-03-08T10:05:03.598Z [pebble] GET /v1/changes/6982/wait?timeout=4.000s 1.188174618s 200

Additional context

No response

simskij commented 1 month ago

I'm of the opinion that we don't want to drop any logs, so for me drop_newest is not a viable option. https://vector.dev/docs/about/under-the-hood/architecture/buffering-model/#overflow-to-another-buffer-overflow seems like it would be exactly what we're searching for, but this is not yet suitable for production, so we'll need to track the status of that mode. Until then, we should add alert rules in vector to surface when the service either

a) goes into a crash loop, or b) has assymetric input/output, indicating backpressure.