canonical / alertmanager-k8s-operator

https://charmhub.io/alertmanager-k8s
Apache License 2.0
5 stars 17 forks source link

error hook failed: "start" in the middle of deployment #282

Open nobuto-m opened 3 months ago

nobuto-m commented 3 months ago

Bug Description

The charm status gets into error unnecessarily in the middle of deployment.

13 Aug 2024 15:24:36Z juju-unit error hook failed: "start"

$ juju show-status-log alertmanager/0 --days 1
Time                   Type       Status       Message
13 Aug 2024 15:23:38Z  workload   waiting      installing agent
13 Aug 2024 15:23:38Z  juju-unit  allocating   
13 Aug 2024 15:24:07Z  workload   waiting      agent initialising
13 Aug 2024 15:24:11Z  workload   maintenance  installing charm software
13 Aug 2024 15:24:11Z  juju-unit  executing    running install hook
13 Aug 2024 15:24:14Z  juju-unit  executing    running alerting-relation-created hook
13 Aug 2024 15:24:16Z  juju-unit  executing    running catalogue-relation-created hook
13 Aug 2024 15:24:17Z  juju-unit  executing    running grafana-dashboard-relation-created hook
13 Aug 2024 15:24:19Z  juju-unit  executing    running ingress-relation-created hook
13 Aug 2024 15:24:20Z  juju-unit  executing    running replicas-relation-created hook
13 Aug 2024 15:24:21Z  juju-unit  executing    running self-metrics-endpoint-relation-created hook
13 Aug 2024 15:24:22Z  juju-unit  executing    running grafana-source-relation-created hook
13 Aug 2024 15:24:23Z  juju-unit  executing    running leader-elected hook
13 Aug 2024 15:24:25Z  juju-unit  executing    running alertmanager-pebble-ready hook
13 Aug 2024 15:24:26Z  juju-unit  executing    running data-storage-attached hook
13 Aug 2024 15:24:27Z  juju-unit  executing    running config-changed hook
13 Aug 2024 15:24:36Z  juju-unit  error        hook failed: "start"
13 Aug 2024 15:24:41Z  juju-unit  executing    running start hook
13 Aug 2024 15:24:44Z  workload   unknown      
13 Aug 2024 15:24:44Z  juju-unit  executing    running alertmanager-pebble-ready hook
13 Aug 2024 15:24:47Z  juju-unit  executing    running catalogue-relation-joined hook for catalogue/0
13 Aug 2024 15:24:49Z  juju-unit  executing    running alerting-relation-joined hook for loki/0

To Reproduce

  1. juju deploy cos-lite --trust --channel latest/edge

Environment

$ juju status
Model  Controller       Cloud/Region            Version  SLA          Timestamp
cos    maas-controller  cos-microk8s/localhost  3.5.3    unsupported  01:04:15Z

App           Version  Status  Scale  Charm             Channel      Rev  Address         Exposed  Message
alertmanager  0.27.0   active      1  alertmanager-k8s  latest/edge  129  10.152.183.171  no       
catalogue              active      1  catalogue-k8s     latest/edge   59  10.152.183.188  no       
grafana       9.5.3    active      1  grafana-k8s       latest/edge  118  10.152.183.65   no       
loki          2.9.6    active      1  loki-k8s          latest/edge  163  10.152.183.227  no       
prometheus    2.52.0   active      1  prometheus-k8s    latest/edge  210  10.152.183.87   no       
traefik       2.11.0   active      1  traefik-k8s       latest/edge  203  10.152.183.222  no       Serving at 192.168.151.81

Relevant log output

unit-alertmanager-0: 15:24:29 DEBUG unit.alertmanager/0.juju-log receive_response_body.complete
unit-alertmanager-0: 15:24:29 DEBUG unit.alertmanager/0.juju-log response_closed.started
unit-alertmanager-0: 15:24:29 DEBUG unit.alertmanager/0.juju-log response_closed.complete
unit-alertmanager-0: 15:24:29 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-alertmanager-0: 15:24:29 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-alertmanager-0: 15:24:29 INFO juju.worker.uniter unit "alertmanager/0" shutting down: agent should be terminated
unit-alertmanager-0: 15:24:32 INFO juju.cmd running containerAgent [3.5.3 63d460f9ee6c7c710131961390687e7a0ab90470 gc go1.21.12]
unit-alertmanager-0: 15:24:32 INFO juju.cmd.containeragent.unit start "unit"
unit-alertmanager-0: 15:24:32 INFO juju.worker.upgradesteps upgrade steps for 3.5.3 have already been run.
unit-alertmanager-0: 15:24:32 INFO juju.worker.probehttpserver starting http server on 127.0.0.1:65301
unit-alertmanager-0: 15:24:32 INFO juju.api connection established to "wss://192.168.151.101:17070/model/e1604679-6045-410d-8aab-53bf0f2ea605/api"
unit-alertmanager-0: 15:24:32 INFO juju.worker.apicaller [e16046] "unit-alertmanager-0" successfully connected to "192.168.151.101:17070"
unit-alertmanager-0: 15:24:32 INFO juju.api connection established to "wss://192.168.151.101:17070/model/e1604679-6045-410d-8aab-53bf0f2ea605/api"
unit-alertmanager-0: 15:24:32 INFO juju.worker.apicaller [e16046] "unit-alertmanager-0" successfully connected to "192.168.151.101:17070"
unit-alertmanager-0: 15:24:32 INFO juju.worker.migrationminion migration migration phase is now: NONE
unit-alertmanager-0: 15:24:32 INFO juju.worker.logger logger worker started
unit-alertmanager-0: 15:24:32 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-alertmanager-0: 15:24:32 INFO juju.worker.leadership alertmanager/0 promoted to leadership of alertmanager
unit-alertmanager-0: 15:24:32 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-alertmanager-0: 15:24:32 INFO juju.worker.caasupgrader unblocking abort check
unit-alertmanager-0: 15:24:32 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-alertmanager-0
unit-alertmanager-0: 15:24:32 INFO juju.worker.uniter unit "alertmanager/0" started
unit-alertmanager-0: 15:24:32 INFO juju.worker.uniter hooks are retried true
unit-alertmanager-0: 15:24:33 INFO juju.worker.uniter.charm downloading ch:amd64/focal/alertmanager-k8s-129 from API server
unit-alertmanager-0: 15:24:33 INFO juju.downloader downloading from ch:amd64/focal/alertmanager-k8s-129
unit-alertmanager-0: 15:24:33 INFO juju.downloader download complete ("ch:amd64/focal/alertmanager-k8s-129")
unit-alertmanager-0: 15:24:33 INFO juju.downloader download verified ("ch:amd64/focal/alertmanager-k8s-129")
unit-alertmanager-0: 15:24:36 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-alertmanager-0: 15:24:37 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-alertmanager-0: 15:24:41 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-alertmanager-0: 15:24:43 DEBUG unit.alertmanager/0.juju-log ops 2.15.0 up and running.

Additional context

cos-lite_debug.log

lucabello commented 2 months ago

We should probably implement a guard to make sure the Kubernetes resource patch is ready before we try to do anything, like it was done in this PR.

Abuelodelanada commented 2 months ago

Hey @nobuto-m

I've deployed the same revisions you had:

Model  Controller  Cloud/Region        Version  SLA          Timestamp
cos    microk8s    microk8s/localhost  3.5.4    unsupported  17:44:01-03:00

App           Version  Status  Scale  Charm             Channel      Rev  Address         Exposed  Message
alertmanager  0.27.0   active      1  alertmanager-k8s  latest/edge  129  10.152.183.108  no       
catalogue              active      1  catalogue-k8s     latest/edge   58  10.152.183.135  no       
grafana       9.5.3    active      1  grafana-k8s       latest/edge  118  10.152.183.21   no       
loki          2.9.6    active      1  loki-k8s          latest/edge  163  10.152.183.84   no       
prometheus    2.52.0   active      1  prometheus-k8s    latest/edge  210  10.152.183.177  no       
traefik       2.11.0   active      1  traefik-k8s       latest/edge  203  10.152.183.222  no       Serving at 192.168.1.240

Unit             Workload  Agent  Address     Ports  Message
alertmanager/0*  active    idle   10.1.9.222         
catalogue/0*     active    idle   10.1.9.210         
grafana/0*       active    idle   10.1.9.230         
loki/0*          active    idle   10.1.9.235         
prometheus/0*    active    idle   10.1.9.220         
traefik/0*       active    idle   10.1.9.226         Serving at 192.168.1.240

And I wasn't able to reproduce the issue:

juju show-status-log alertmanager/0  --days 1
Time                        Type       Status       Message
18 Sep 2024 17:15:59-03:00  workload   waiting      installing agent
18 Sep 2024 17:15:59-03:00  juju-unit  allocating   
18 Sep 2024 17:16:15-03:00  workload   waiting      agent initialising
18 Sep 2024 17:16:42-03:00  workload   maintenance  installing charm software
18 Sep 2024 17:16:42-03:00  juju-unit  executing    running install hook
18 Sep 2024 17:16:45-03:00  juju-unit  executing    running alerting-relation-created hook
18 Sep 2024 17:16:47-03:00  juju-unit  executing    running catalogue-relation-created hook
18 Sep 2024 17:16:48-03:00  juju-unit  executing    running grafana-dashboard-relation-created hook
18 Sep 2024 17:16:50-03:00  juju-unit  executing    running grafana-source-relation-created hook
18 Sep 2024 17:16:51-03:00  juju-unit  executing    running ingress-relation-created hook
18 Sep 2024 17:16:52-03:00  juju-unit  executing    running replicas-relation-created hook
18 Sep 2024 17:16:53-03:00  juju-unit  executing    running self-metrics-endpoint-relation-created hook
18 Sep 2024 17:16:54-03:00  juju-unit  executing    running leader-elected hook
18 Sep 2024 17:16:55-03:00  juju-unit  executing    running alertmanager-pebble-ready hook
18 Sep 2024 17:16:56-03:00  juju-unit  executing    running data-storage-attached hook
18 Sep 2024 17:17:42-03:00  juju-unit  executing    running config-changed hook
18 Sep 2024 17:17:53-03:00  juju-unit  executing    running start hook
18 Sep 2024 17:17:55-03:00  juju-unit  executing    running alertmanager-pebble-ready hook
18 Sep 2024 17:17:58-03:00  juju-unit  executing    running ingress-relation-joined hook for traefik/0
18 Sep 2024 17:17:59-03:00  juju-unit  executing    running alerting-relation-joined hook for prometheus/0
18 Sep 2024 17:18:01-03:00  juju-unit  executing    running ingress-relation-changed hook for traefik/0
18 Sep 2024 17:18:02-03:00  juju-unit  executing    running catalogue-relation-joined hook for catalogue/0
18 Sep 2024 17:18:04-03:00  juju-unit  executing    running grafana-dashboard-relation-joined hook for grafana/0
18 Sep 2024 17:18:04-03:00  juju-unit  idle         
18 Sep 2024 17:18:05-03:00  juju-unit  executing    running catalogue-relation-changed hook for catalogue/0
18 Sep 2024 17:18:06-03:00  juju-unit  executing    running grafana-dashboard-relation-changed hook for grafana/0
18 Sep 2024 17:18:07-03:00  juju-unit  idle         
18 Sep 2024 17:18:07-03:00  juju-unit  executing    running self-metrics-endpoint-relation-joined hook for prometheus/0
18 Sep 2024 17:18:08-03:00  juju-unit  executing    running grafana-source-relation-joined hook for grafana/0
18 Sep 2024 17:18:09-03:00  juju-unit  executing    running grafana-source-relation-changed hook for grafana/0
18 Sep 2024 17:18:10-03:00  juju-unit  executing    running alerting-relation-changed hook for prometheus/0
18 Sep 2024 17:18:11-03:00  juju-unit  executing    running alerting-relation-joined hook for loki/0
18 Sep 2024 17:18:12-03:00  juju-unit  executing    running self-metrics-endpoint-relation-changed hook for prometheus/0
18 Sep 2024 17:18:13-03:00  juju-unit  executing    running ingress-relation-changed hook
18 Sep 2024 17:18:16-03:00  workload   active       
18 Sep 2024 17:18:17-03:00  juju-unit  executing    running alerting-relation-changed hook for loki/0
18 Sep 2024 17:18:18-03:00  juju-unit  idle         

I believe that it's a timing issue... How much RAM and CPU does the machine have? 🤔

Please, may you try deploying now from edge and tell me if you see the same issue?