canonical / grafana-k8s-operator

https://charmhub.io/grafana-k8s
Apache License 2.0
6 stars 22 forks source link

Grafana fails with hook failed: "grafana-dashboard-relation-created" #140

Closed Sponge-Bas closed 2 years ago

Sponge-Bas commented 2 years ago

Bug Description

In testrun https://solutions.qa.canonical.com/v2/testruns/f12b5292-2b3e-4d63-a466-f5c2c4366f57/ the cos-stack fais to install with the juju status:

Model       Controller       Cloud/Region                Version  SLA          Timestamp
controller  foundations-k8s  kubernetes_cloud/us-east-1  2.9.35   unsupported  12:19:06Z

Model "controller" is empty.
Model  Controller       Cloud/Region                Version  SLA          Timestamp
cos    foundations-k8s  kubernetes_cloud/us-east-1  2.9.35   unsupported  12:19:06Z

App           Version  Status   Scale  Charm             Channel  Rev  Address                                                                 Exposed  Message
alertmanager  0.23.0   waiting      1  alertmanager-k8s  edge      33  10.152.183.222                                                          no       waiting for container
catalogue              active       1  catalogue-k8s     edge       3  10.152.183.119                                                          no       
grafana                waiting    0/1  grafana-k8s       edge      45  10.152.183.97                                                           no       installing agent
loki                   waiting      1  loki-k8s          edge      45  10.152.183.153                                                          no       installing agent
prometheus             waiting      1  prometheus-k8s    edge      75  10.152.183.67                                                           no       installing agent
traefik                waiting      1  traefik-k8s       edge      89  ae8a8e7eac33646d4ada3fe220bc3831-239371593.us-east-1.elb.amazonaws.com  no       installing agent

Unit             Workload  Agent  Address         Ports  Message
alertmanager/0*  active    idle   192.168.133.73         
catalogue/0*     active    idle   192.168.218.70         
grafana/0*       error     lost   192.168.133.71         hook failed: "grafana-dashboard-relation-created"
loki/0*          waiting   idle   192.168.218.72         Waiting for resource limit patch to apply
prometheus/0*    waiting   idle   192.168.218.71         Waiting for resource limit patch to apply
traefik/0*       error     idle   192.168.133.72         hook failed: "metrics-endpoint-relation-joined"

To Reproduce

We got to this state by deploying the cos-stack on charmed kubernetes on jammy.

Environment

The k8s is ck8s 1.24 on jammy, cos is latest/edge

Relevant log output

Nothing in the unit logs indicates any errors:

unit-grafana-0: 12:18:11 INFO juju.cmd running containerAgent [2.9.35 da3416008ea4ce7851a4c967ae191a0044917024 gc go1.19.2]
unit-grafana-0: 12:18:11 WARNING cmd developer feature flags enabled: "actions-v2"
unit-grafana-0: 12:18:11 INFO juju.cmd.containeragent.unit start "unit"
unit-grafana-0: 12:18:11 INFO juju.worker.upgradesteps upgrade steps for 2.9.35 have already been run.
unit-grafana-0: 12:18:11 INFO juju.worker.probehttpserver starting http server on [::]:65301
unit-grafana-0: 12:18:11 INFO juju.api cannot resolve "a703653953140478c9ac07832662e677-1313776419.us-east-1.elb.amazonaws.com": lookup a703653953140478c9ac07832662e677-1313776419.us-east-1.elb.amazonaws.com: operation was canceled
unit-grafana-0: 12:18:11 INFO juju.api connection established to "wss://controller-service.controller-foundations-k8s.svc.cluster.local:17070/model/81fcf692-fbf2-46ef-8d2a-48bf98f3f492/api"
unit-grafana-0: 12:18:11 INFO juju.worker.apicaller [81fcf6] "unit-grafana-0" successfully connected to "controller-service.controller-foundations-k8s.svc.cluster.local:17070"
unit-grafana-0: 12:18:11 INFO juju.api cannot resolve "a703653953140478c9ac07832662e677-1313776419.us-east-1.elb.amazonaws.com": lookup a703653953140478c9ac07832662e677-1313776419.us-east-1.elb.amazonaws.com: operation was canceled
unit-grafana-0: 12:18:11 INFO juju.api connection established to "wss://controller-service.controller-foundations-k8s.svc.cluster.local:17070/model/81fcf692-fbf2-46ef-8d2a-48bf98f3f492/api"
unit-grafana-0: 12:18:11 INFO juju.worker.apicaller [81fcf6] "unit-grafana-0" successfully connected to "controller-service.controller-foundations-k8s.svc.cluster.local:17070"
unit-grafana-0: 12:18:11 INFO juju.worker.migrationminion migration phase is now: NONE
unit-grafana-0: 12:18:11 INFO juju.worker.logger logger worker started
unit-grafana-0: 12:18:11 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-grafana-0: 12:18:11 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-grafana-0: 12:18:11 INFO juju.worker.caasupgrader unblocking abort check
unit-grafana-0: 12:18:11 INFO juju.worker.leadership grafana/0 promoted to leadership of grafana
unit-grafana-0: 12:18:11 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-grafana-0
unit-grafana-0: 12:18:11 INFO juju.worker.uniter unit "grafana/0" started
unit-grafana-0: 12:18:11 INFO juju.worker.uniter resuming charm install
unit-grafana-0: 12:18:11 INFO juju.worker.uniter.charm downloading ch:amd64/focal/grafana-k8s-45 from API server
unit-grafana-0: 12:18:11 INFO juju.downloader downloading from ch:amd64/focal/grafana-k8s-45
unit-grafana-0: 12:18:12 INFO juju.downloader download complete ("ch:amd64/focal/grafana-k8s-45")
unit-grafana-0: 12:18:12 INFO juju.downloader download verified ("ch:amd64/focal/grafana-k8s-45")
unit-grafana-0: 12:18:27 INFO juju.worker.uniter hooks are retried true
unit-grafana-0: 12:18:27 INFO juju.worker.uniter found queued "install" hook
unit-grafana-0: 12:18:30 INFO unit.grafana/0.juju-log Running legacy hooks/install.
unit-grafana-0: 12:18:32 INFO unit.grafana/0.juju-log Successfully patched the Kubernetes service!
unit-grafana-0: 12:18:32 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)
unit-grafana-0: 12:18:35 INFO juju.worker.uniter.operation ran "catalogue-relation-created" hook (via hook dispatching script: dispatch)
unit-grafana-0: 12:18:36 INFO juju.worker.uniter.operation ran "grafana-dashboard-relation-created" hook (via hook dispatching script: dispatch)
unit-grafana-0: 12:18:37 INFO juju.worker.uniter.operation ran "ingress-relation-created" hook (via hook dispatching script: dispatch)
unit-grafana-0: 12:18:39 INFO juju.worker.uniter.operation ran "grafana-relation-created" hook (via hook dispatching script: dispatch)
unit-grafana-0: 12:18:40 INFO juju.worker.uniter.operation ran "grafana-source-relation-created" hook (via hook dispatching script: dispatch)
unit-grafana-0: 12:18:41 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-grafana-0: 12:18:42 ERROR juju.worker.uniter.operation hook "grafana-dashboard-relation-created" (via hook dispatching script: dispatch) failed: signal: terminated
unit-grafana-0: 12:18:42 INFO juju.worker.uniter awaiting error resolution for "relation-created" hook
unit-grafana-0: 12:18:42 INFO juju.worker.uniter awaiting error resolution for "relation-created" hook

Additional context

Crashdumps and config can be found here: https://oil-jenkins.canonical.com/artifacts/f12b5292-2b3e-4d63-a466-f5c2c4366f57/index.html

rbarry82 commented 2 years ago

Let me see when the latest release went to edge, because this is almost exactly the same as what was fixed in this PR. The short answer is that this is a race, and probably the same one as traefik/0 saw on metric-endpoint-relation-joined. When a bundle is deployed, the KubernetesServicePatch may actually patch or delete and recreate the service during the execution of another hook, which subsequently fails. Not with any error, but because the pod was deleted while it was running. @sed-i -- for awareness. I'll see if I can get a reasonable patch for the lib itself.