Grafana stuck executing after "Exec 19: timeout waiting for websocket connections: context deadline exceeded"

asbalderson commented 10 months ago

Bug Description

On a deployment of COS stable with a 3 node microk8s and microceph back-end. The grafana unit stayed stuck executing for over 4 hours, bumping into our timeout for hanging. The unit keeps executing config changed events for the whole duration, but never resolves if it is executing or not. The grafana logs show that it still runs a handful of commands but then stops with a Exec 19: timeout waiting for websocket connections: context deadline exceeded and never runs anything else over the 4 hour period.

Nothing else stands out in the bugs description.

To Reproduce

juju deploy cos-lite --overlay small-storage-overlay.yaml --overlay tls-overlay.yaml --overlay testing-overlay.yaml

# Storage overlay
#
# This sample overlay shows how to configure the amount of
# storage to grant to the various COS components.
applications:
  alertmanager:
    storage:
      # The Alertmanager operator uses the `database` storage to
      # store nflog and silence snapshots.
      # See the `--storage.path` argument of Alertmanager
      data: 2G
  grafana:
    storage:
      # The Grafana operator uses the `database` storage to store
      # configurations, plugins, user data, etc. That is, everything
      # that is normally stored under /var/lib/grafana.
      database: 2G
  loki:
    storage:
      # This is the volume used by loki for the index of boltdb-shipper.
      # See https://grafana.com/docs/loki/latest/configuration/examples/
      active-index-directory: 2G
      # Loki will store in this volume the chunks, which contain the
      # actual log content.
      # See https://grafana.com/docs/loki/latest/operations/storage/
      loki-chunks: 10G
  prometheus:
    storage:
      # Prometheus operator uses the `database` storage to store TSDB data.
      # See the `--storage.tsdb.path` argument of Prometheus.
      database: 10G

# TLS overlay
applications:
  ca:
    charm: self-signed-certificates
    channel: edge
    scale: 1
    options:
      ca-common-name: ca.demo.local
  external-ca:
    # This charm needs to be replaced with a real CA charm.
    # Use `juju refresh --switch` to replace via a "crossgrade refresh".
    charm: self-signed-certificates
    channel: edge
    scale: 1
    options:
      ca-common-name: external-ca.example.com

relations:
 # This is a more general CA (e.g. root CA) that signs traefik's own CSR.
 - [external-ca, traefik:certificates]

  # This is the local CA that signs CSRs from COS charms (excluding traefik).
  # Traefik is trusting this CA so that it could load balance via TLS.
 - [ca, traefik:receive-ca-cert]

 - [ca, alertmanager:certificates]
 - [ca, prometheus:certificates]
 - [ca, grafana:certificates]
 - [ca, loki:certificates]
 - [ca, catalogue:certificates]

# Testing Overlay
applications:
  alertmanager:
    scale: 2
    # Set low resource requests to accommodate small VMs
    options:
      cpu: 1m
      memory: 1Mi
  prometheus:
    # Set low resource requests to accommodate small VMs
    options:
      cpu: 1m
      memory: 1Mi
  grafana:
    # Set low resource requests to accommodate small VMs
    options:
      cpu: 1m
      memory: 1Mi
  loki:
    # Set low resource requests to accommodate small VMs
    options:
      cpu: 1m
      memory: 1Mi
  avalanche:
    charm: avalanche-k8s
    channel: edge
    scale: 2
    trust: true
    options:
      metric_count: 10
      series_count: 2

relations:
  - [avalanche:metrics-endpoint, prometheus]

Environment

Running on: microk8s - v1.28.3 microceph (snap) - latest/edge cos-lite - latest/stable:11 metallb - v0.13.10

Relevant log output

from grafana logs:
2024-01-16T03:37:32.044959467Z stderr F 2024-01-16T03:37:32.044Z [pebble] GET /v1/changes/16/wait?timeout=4.000s 1.0201003s 200
2024-01-16T03:37:32.110507133Z stdout F 2024-01-16T03:37:32.110Z [grafana] logger=migrator t=2024-01-16T03:37:32.110169145Z level=info msg="Executing migration" id="alter alert_rule table data column to mediumtext in mysql"
2024-01-16T03:37:32.118688051Z stderr F 2024-01-16T03:37:32.117Z [pebble] POST /v1/files 53.25769ms 200
2024-01-16T03:37:32.2238537Z stderr F 2024-01-16T03:37:32.222Z [pebble] POST /v1/exec 100.911324ms 202
2024-01-16T03:37:32.228219063Z stderr F 2024-01-16T03:37:32.226Z [pebble] GET /v1/tasks/17/websocket/control 512.422µs 200
2024-01-16T03:37:32.228252585Z stderr F 2024-01-16T03:37:32.228Z [pebble] GET /v1/tasks/17/websocket/stdio 58.318µs 200
2024-01-16T03:37:32.2308767Z stderr F 2024-01-16T03:37:32.229Z [pebble] GET /v1/tasks/17/websocket/stderr 56.178µs 200
2024-01-16T03:37:32.275297745Z stdout F 2024-01-16T03:37:32.274Z [grafana] logger=migrator t=2024-01-16T03:37:32.266897897Z level=info msg="Executing migration" id="add column for to alert_rule"
2024-01-16T03:37:32.298035757Z stderr F 2024-01-16T03:37:32.297Z [pebble] GET /v1/changes/17/wait?timeout=4.000s 67.806172ms 200
2024-01-16T03:37:32.362580189Z stdout F 2024-01-16T03:37:32.362Z [grafana] logger=migrator t=2024-01-16T03:37:32.362289259Z level=info msg="Executing migration" id="add column annotations to alert_rule"
2024-01-16T03:37:32.510454928Z stdout F 2024-01-16T03:37:32.510Z [grafana] logger=migrator t=2024-01-16T03:37:32.510290484Z level=info msg="Executing migration" id="add column labels to alert_rule"
2024-01-16T03:37:32.515924111Z stderr F 2024-01-16T03:37:32.515Z [pebble] GET /v1/files?action=read&path=%2Fetc%2Fgrafana%2Fgrafana-config.ini 97.751µs 200
2024-01-16T03:37:32.519063318Z stderr F 2024-01-16T03:37:32.518Z [pebble] GET /v1/plan?format=yaml 135.037µs 200
2024-01-16T03:37:32.522480764Z stderr F 2024-01-16T03:37:32.522Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Fgrafana%2Fgrafana.crt&itself=true 138.467µs 200
2024-01-16T03:37:32.523503602Z stderr F 2024-01-16T03:37:32.523Z [pebble] GET /v1/plan?format=yaml 154.979µs 200
2024-01-16T03:37:32.612165021Z stdout F 2024-01-16T03:37:32.611Z [grafana] logger=migrator t=2024-01-16T03:37:32.610363226Z level=info msg="Executing migration" id="remove unique index from alert_rule on org_id, title columns"
2024-01-16T03:37:32.751029367Z stdout F 2024-01-16T03:37:32.750Z [grafana] logger=migrator t=2024-01-16T03:37:32.74940012Z level=info msg="Executing migration" id="add index in alert_rule on org_id, namespase_uid and title columns"
2024-01-16T03:37:32.930828644Z stdout F 2024-01-16T03:37:32.922Z [grafana] logger=migrator t=2024-01-16T03:37:32.922290584Z level=info msg="Executing migration" id="add dashboard_uid column to alert_rule"
2024-01-16T03:37:33.083516937Z stdout F 2024-01-16T03:37:33.083Z [grafana] logger=migrator t=2024-01-16T03:37:33.083346919Z level=info msg="Executing migration" id="add panel_id column to alert_rule"
2024-01-16T03:37:33.274148687Z stdout F 2024-01-16T03:37:33.274Z [grafana] logger=migrator t=2024-01-16T03:37:33.273938776Z level=info msg="Executing migration" id="add index in alert_rule on org_id, dashboard_uid and panel_id columns"
2024-01-16T03:37:33.389053587Z stderr F 2024-01-16T03:37:33.388Z [pebble] POST /v1/files 15.649629ms 200
2024-01-16T03:37:33.400407947Z stderr F 2024-01-16T03:37:33.396Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Fgrafana%2Fgrafana.crt&itself=true 91.165µs 200
2024-01-16T03:37:33.402971885Z stderr F 2024-01-16T03:37:33.402Z [pebble] GET /v1/plan?format=yaml 178µs 200
2024-01-16T03:37:33.413484984Z stderr F 2024-01-16T03:37:33.413Z [pebble] POST /v1/layers 206.291µs 200
2024-01-16T03:37:33.416391526Z stderr F 2024-01-16T03:37:33.415Z [pebble] GET /v1/services?names=grafana 48.368µs 200
2024-01-16T03:37:33.424911275Z stdout F 2024-01-16T03:37:33.423Z [grafana] logger=migrator t=2024-01-16T03:37:33.423545041Z level=info msg="Executing migration" id="add rule_group_idx column to alert_rule"
2024-01-16T03:37:33.429173723Z stderr F 2024-01-16T03:37:33.428Z [pebble] POST /v1/services 10.490339ms 202
2024-01-16T03:37:33.480006617Z stderr F 2024-01-16T03:37:33.479Z [pebble] Service "grafana" stopped
2024-01-16T03:37:33.487917974Z stderr F 2024-01-16T03:37:33.487Z [pebble] GET /v1/changes/18/wait?timeout=4.000s 57.393551ms 200
2024-01-16T03:37:33.494897995Z stderr F 2024-01-16T03:37:33.493Z [pebble] POST /v1/files 4.487638ms 200
2024-01-16T03:37:33.504232314Z stderr F 2024-01-16T03:37:33.503Z [pebble] POST /v1/files 9.445754ms 200
2024-01-16T03:37:33.510299309Z stderr F 2024-01-16T03:37:33.510Z [pebble] POST /v1/files 5.822565ms 200
2024-01-16T03:37:33.524761196Z stderr F 2024-01-16T03:37:33.523Z [pebble] POST /v1/exec 8.118187ms 202
2024-01-16T03:37:38.542857354Z stderr F 2024-01-16T03:37:38.541Z [pebble] Exec 19: timeout waiting for websocket connections: context deadline exceeded