canonical / loki-k8s-operator

https://charmhub.io/loki-k8s
Apache License 2.0
10 stars 16 forks source link

Charm stuck in the blocked state after enabling TLS for Traefik #338

Closed natalytvinova closed 9 months ago

natalytvinova commented 9 months ago

Bug Description

Loki from latest/stable goes into this state Errors in alert rule groups. Check juju debug-log after enabling TLS overlay. Without the overlay, the charm doesn't go into this state.

To Reproduce

  1. Juju deploy cos bundle with TLS and storage overlays. Bundles are in the comments, for some reason can't attach .yamlg

Environment

Juju is run locally on 3 infra nodes with version 3.1.7, Loki is on the latest/stable rev 105. Microk8s version: channel: 1.28/stable, charm latest/stable

Relevant log output

Here is a juju-status-log:

Time                   Type       Status     Message
25 Jan 2024 16:20:21Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/13
25 Jan 2024 16:20:23Z  juju-unit  executing  running logging-relation-changed hook
25 Jan 2024 16:20:26Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/15
25 Jan 2024 16:20:28Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/15
25 Jan 2024 16:20:31Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/18
25 Jan 2024 16:20:33Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/18
25 Jan 2024 16:20:36Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/20
25 Jan 2024 16:20:38Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/20
25 Jan 2024 16:20:41Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/23
25 Jan 2024 16:20:43Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/23
25 Jan 2024 16:20:46Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/25
25 Jan 2024 16:20:48Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/25
25 Jan 2024 16:20:50Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/26
25 Jan 2024 16:20:53Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/26
25 Jan 2024 16:20:55Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/27
25 Jan 2024 16:20:57Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/27
25 Jan 2024 16:21:00Z  juju-unit  executing  running logging-relation-joined hook for remote-7ca9e187477947fe88fc3d08252419e5/28
25 Jan 2024 16:21:03Z  juju-unit  executing  running logging-relation-changed hook for remote-7ca9e187477947fe88fc3d08252419e5/28
25 Jan 2024 16:21:04Z  workload   blocked    Errors in alert rule groups. Check juju debug-log
25 Jan 2024 16:21:05Z  juju-unit  idle 

Here is juju show-unit loki/0 show-unit-loki.log Here are the alert rules from the charm that are in place alert-rules.txt

przemeklal commented 9 months ago

I'm hitting the same issue I beleive, juju debug-log shows:

unit-loki-0: 10:26:31 ERROR unit.loki/0.juju-log certificates:32: Checking alert rules: 400 - Bad Request

It looks like the alert rules aren't even validated and the charm code fails to obtain them.

I also use the same overlay https://github.com/canonical/cos-lite-bundle/blob/main/overlays/tls-overlay.yaml and I hit the issue after relating grafana-agents from other models to Loki.

juju debug-log output: loki_debug_log.log

natalytvinova commented 9 months ago

COS bundle:

---
bundle: kubernetes
applications:
  traefik:
    charm: /home/ubuntu/deployment/example/charms/cos/traefik-k8s_r129.charm
    scale: 1
    trust: true
    channel: null
    resources:
      traefik-image: ghcr.io/canonical/traefik:2.10.4  
  alertmanager:
    charm: /home/ubuntu/deployment/example/charms/cos/alertmanager-k8s_r77.charm    
    scale: 1
    trust: true
    channel: null
    resources:
      alertmanager-image: docker.io/ubuntu/prometheus-alertmanager:latest
  prometheus:
    charm: /home/ubuntu/deployment/example/charms/cos/prometheus-k8s_r129.charm
    scale: 1
    trust: true
    channel: null
    resources:
      prometheus-image: ghcr.io/canonical/prometheus:2.46.0
    options:
      metrics_retention_time: 62d    
  grafana:
    charm: /home/ubuntu/deployment/example/charms/cos/grafana-k8s_r82.charm
    scale: 1
    trust: true
    channel: null
    resources:
      grafana-image: docker.io/ubuntu/grafana:latest
      litestream-image: docker.io/litestream/litestream:latest
  catalogue:
    charm: /home/ubuntu/deployment/example/charms/cos/catalogue-k8s_r19.charm
    scale: 1
    trust: true
    channel: null
    resources:
      catalogue-image: ghcr.io/canonical/catalogue-k8s-operator:latest
    options:
      title: Example prod2or Canonical Observability Stack
      tagline: Model-driven Observability Stack deployed with a single command.
      description: |
        Canonical Observability Stack Lite, or COS Lite, is a light-weight, highly-integrated,
        Juju-based observability suite running on Kubernetes.
  loki:
    charm: /home/ubuntu/deployment/example/charms/cos/loki-k8s_r91.charm
    scale: 1
    trust: true
    channel: null
    resources:
      loki-image: ghcr.io/canonical/loki:2.7.4

relations:
- [traefik:ingress-per-unit, prometheus:ingress]
- [traefik:ingress-per-unit, loki:ingress]
- [traefik:traefik-route, grafana:ingress]
- [traefik:ingress, alertmanager:ingress]
- [prometheus:alertmanager, alertmanager:alerting]
- [grafana:grafana-source, prometheus:grafana-source]
- [grafana:grafana-source, loki:grafana-source]
- [grafana:grafana-source, alertmanager:grafana-source]
- [loki:alertmanager, alertmanager:alerting]
# COS-monitoring
- [prometheus:metrics-endpoint, traefik:metrics-endpoint]
- [prometheus:metrics-endpoint, alertmanager:self-metrics-endpoint]
- [prometheus:metrics-endpoint, loki:metrics-endpoint]
- [prometheus:metrics-endpoint, grafana:metrics-endpoint]
- [grafana:grafana-dashboard, loki:grafana-dashboard]
- [grafana:grafana-dashboard, prometheus:grafana-dashboard]
- [grafana:grafana-dashboard, alertmanager:grafana-dashboard]
# Service Catalogue
- [catalogue:ingress, traefik:ingress]
- [catalogue:catalogue, grafana:catalogue]
- [catalogue:catalogue, prometheus:catalogue]
- [catalogue:catalogue, alertmanager:catalogue]
natalytvinova commented 9 months ago

Offers:

applications:
  alertmanager:
    offers:
      alertmanager:
        endpoints:
        - karma-dashboard
  grafana:
    offers:
      grafana:
        endpoints:
        - grafana-dashboard
  loki:
    offers:
      loki:
        endpoints:
        - logging
  prometheus:
    offers:
      prometheus:
        endpoints:
        - metrics-endpoint
        - receive-remote-write
natalytvinova commented 9 months ago

TLS:

applications:
  ca:
    charm: self-signed-certificates
    channel: edge
    scale: 1
    options:
      ca-common-name: traefik-0.traefik-endpoints.cos.svc.cluster.local
  external-ca:
    # This charm needs to be replaced with a real CA charm.
    # Use `juju refresh --switch` to replace via a "crossgrade refresh".
    charm: self-signed-certificates
    channel: edge
    scale: 1
    options:
      #ca-common-name: external-ca.example.com
      ca-common-name: traefik-0.traefik-endpoints.cos.svc.cluster.local

relations:
 # This is a more general CA (e.g. root CA) that signs traefik's own CSR.
 - [external-ca, traefik:certificates]

  # This is the local CA that signs CSRs from COS charms (excluding traefik).
  # Traefik is trusting this CA so that it could load balance via TLS.
 - [ca, traefik:receive-ca-cert]

 - [ca, alertmanager:certificates]
 - [ca, prometheus:certificates]
 - [ca, grafana:certificates]
 - [ca, loki:certificates]
 - [ca, catalogue:certificates]
natalytvinova commented 9 months ago

Options overlay:

bundle: kubernetes
applications:
  scrape-interval-config:
    channel: null
    charm: /home/ubuntu/deployment/example/charms/cos/prometheus-scrape-config-k8s_r39.charm
    scale: 1
    trust: true
    options:
      scrape_timeout: 30s
      scrape_interval: 5m
    offers:
      scrape-interval-config:
        endpoints:
          - configurable-scrape-jobs
relations:
  - [ scrape-interval-config:metrics-endpoint, prometheus:metrics-endpoint]
natalytvinova commented 9 months ago

COS Relations in Openstack:

- ['cos-grafana:grafana-dashboard', 'cos-proxy:downstream-grafana-dashboard']
- ['cos-loki:logging', 'cos-proxy:downstream-logging']
- ['cos-prometheus:metrics-endpoint', 'cos-proxy:downstream-prometheus-scrape']
- ['cos-proxy:dashboards', 'etcd:grafana']
- ['cos-proxy:dashboards', 'prometheus-grok-exporter:dashboards']
- ['cos-proxy:dashboards', 'prometheus-openstack-exporter:dashboards']
- ['cos-proxy:dashboards', 'telegraf:dashboards']
- ['cos-proxy:filebeat', 'filebeat:logstash']
- ['cos-proxy:juju-info', 'filebeat:beats-host']
- ['cos-proxy:juju-info', 'landscape-client:container']
- ['cos-proxy:juju-info', 'nrpe:general-info']
- ['cos-proxy:juju-info', 'prometheus-grok-exporter:juju-info']
- ['cos-proxy:juju-info', 'telegraf:juju-info']
- ['cos-proxy:juju-info', 'ubuntu-advantage:juju-info']
- ['cos-proxy:monitors', 'nrpe:monitors']
- ['cos-proxy:prometheus-rules', 'telegraf:prometheus-rules']
- ['cos-proxy:prometheus-target', 'telegraf:prometheus-client']
natalytvinova commented 8 months ago

Hi team, which channel and revision contains that fix? I'm using latest/stable right now and facing this issue

mmkay commented 8 months ago

The fix was in revision 117: https://github.com/canonical/loki-k8s-operator/releases/tag/rev117

It's currently in latest/candidate, latest/beta and latest/edge.

Screenshot from 2024-03-06 11-36-49

natalytvinova commented 8 months ago

Thanks @mmkay !