canonical / grafana-agent-k8s-operator

This charmed operator automates the operational procedures of running Grafana Agent, an open-soruce telemetry collector.
https://charmhub.io/grafana-agent-k8s
Apache License 2.0
8 stars 18 forks source link

Alert rules not firing due to extra "juju_charm" key in labels #237

Closed dashmage closed 1 year ago

dashmage commented 1 year ago

Bug Description

While testing the alerts for the hardware-observer charm, I'm not able to see any of them firing even when the metric being exported was giving the fail value.

Example For the redfish_call_success metric, I've provided wrong credentials and the querying the exporter shows that the metric has value 0.

ubuntu@bomberto:~$ curl localhost:10000

(...)

# HELP redfish_service_available Indicates if redfish service is available or not on the system.
# TYPE redfish_service_available gauge
redfish_service_available 1.0
# HELP redfish_call_success Indicates if call to the redfish API succeeded or not.
# TYPE redfish_call_success gauge
redfish_call_success 0.0

This is reflected on COS Prometheus as well image

But the alert rule attached to the failure of this metric is not being fired. image

Upon troubleshooting this for a bit, I found that the alert rule is being injected with an extra juju_charm key which is not present in the metric itself. So all the alerts that are supposed to be fired are not being triggered due to this extra key.

To Reproduce

Setup of principal and subordinate charms

juju deploy ubuntu
juju deploy hardware-observer
juju deploy grafana-agent

juju relate ubuntu hardware-observer
juju relate ubuntu grafana-agent
juju relate hardware-observer grafana-agent

COS Setup on microk8s

juju deploy microk8s
juju config microk8s addons="dns ingress hostpath-storage metallb:10.245.130.50-10.245.130.50"

# add microk8s cloud to controller
juju add-k8s micro -c my-ctrl

# add new model to cloud
juju add-model cos micro
juju deploy cos-lite --channel edge --trust
juju offer prometheus:receive-remote-write

Setting up the CMR

juju relate grafana-agent micro:cos.prometheus

Environment

Running juju on a MAAS cloud backend. Deployed all applications from latest/stable

❯ juju --version
2.9.44-ubuntu-amd64
❯ juju status
Model  Controller    Cloud/Region     Version  SLA          Timestamp
hw     ct-maas-ctrl  ct-maas/default  2.9.43   unsupported  14:36:35+05:30

SAAS        Status  Store         URL
prometheus  active  ct-maas-ctrl  ashley/cos.prometheus

App                Version  Status  Scale  Charm              Channel        Rev  Exposed  Message
grafana-agent               active      6  grafana-agent                       4  no       logging-consumer: off, grafana-cloud-config: off
hardware-observer           error       6  hardware-observer                  26  no       hook failed: "upgrade-charm"
microk8s                    active      1  microk8s           legacy/stable  101  no       
ubuntu                      active      6  ubuntu             latest/stable   24  no       

Unit                     Workload  Agent  Machine  Public address  Ports                     Message
microk8s/0*              active    idle   13       10.1.10.204     80/tcp,443/tcp,16443/tcp  
ubuntu/3*                active    idle   3        10.1.11.163                               
  grafana-agent/41*      active    idle            10.1.11.163                               grafana-cloud-config: off, logging-consumer: off
  hardware-observer/59   error     idle            10.1.11.163                               hook failed: "upgrade-charm"
ubuntu/4                 active    idle   4        10.1.11.46                                
  grafana-agent/42       active    idle            10.1.11.46                                logging-consumer: off, grafana-cloud-config: off
  hardware-observer/58*  active    idle            10.1.11.46                                Unit is ready
ubuntu/5                 active    idle   5        10.245.130.6                              
  grafana-agent/44       active    idle            10.245.130.6                              grafana-cloud-config: off, logging-consumer: off
  hardware-observer/60   blocked   idle            10.245.130.6                              Missing resources: ['sas2ircu-bin']
ubuntu/7                 active    idle   7        10.1.11.55                                
  grafana-agent/40       active    idle            10.1.11.55                                grafana-cloud-config: off, logging-consumer: off
  hardware-observer/56   active    idle            10.1.11.55                                Unit is ready
ubuntu/10                active    idle   10       10.1.10.226                               
  grafana-agent/43       active    idle            10.1.10.226                               logging-consumer: off, grafana-cloud-config: off
  hardware-observer/61   blocked   idle            10.1.10.226                               Missing resources: ['sas2ircu-bin']
ubuntu/12                active    idle   12       10.1.25.86                                
  grafana-agent/39       active    idle            10.1.25.86                                logging-consumer: off, grafana-cloud-config: off
  hardware-observer/57   blocked   idle            10.1.25.86                                Missing resources: ['storcli-deb']

Machine  State    Address       Inst id   Series  AZ        Message
3        started  10.1.11.163   coinfish  jammy   default   Deployed
4        started  10.1.11.46    bomberto  jammy   default   Deployed
5        started  10.245.130.6  gurley    jammy   Cert Lab  Deployed
7        started  10.1.11.55    rozary    jammy   default   Deployed
10       started  10.1.10.226   prunus    jammy   Cert Lab  Deployed
12       started  10.1.25.86    kongfu    jammy   Cert Lab  Deployed
13       started  10.1.10.204   birdo     jammy   default   Deployed

Relevant log output

Output of running `juju show-unit grafana-agent/x`: https://pastebin.ubuntu.com/p/nDTjPnBx6W/

Additional context

ubuntu@bomberto:~$ cat /etc/grafana-agent.yaml 
integrations:
  agent:
    enabled: true
    relabel_configs:
    - regex: (.*)
      replacement: juju_hw_dfaf0254-3e9c-4684-8575-5b86266f1581_grafana-agent_self-monitoring
      target_label: job
    - regex: (.*)
      replacement: hw_dfaf0254-3e9c-4684-8575-5b86266f1581_hardware-observer_hardware-observer/58
      target_label: instance
    - replacement: grafana-agent
      source_labels:
      - __address__
      target_label: juju_charm
    - replacement: hw
      source_labels:
      - __address__
      target_label: juju_model
    - replacement: dfaf0254-3e9c-4684-8575-5b86266f1581
      source_labels:
      - __address__
      target_label: juju_model_uuid
    - replacement: grafana-agent
      source_labels:
      - __address__
      target_label: juju_application
    - replacement: grafana-agent/42
      source_labels:
      - __address__
      target_label: juju_unit
  node_exporter:
    enable_collectors:
    - logind
    - systemd
    - mountstats
    - processes
    - sysctl
    enabled: true
    relabel_configs:
    - regex: (.*)
      replacement: juju_hw_dfaf0254-3e9c-4684-8575-5b86266f1581_grafana-agent_node-exporter
      target_label: job
    - regex: (.*)
      replacement: hw_dfaf0254-3e9c-4684-8575-5b86266f1581_hardware-observer_hardware-observer/58
      target_label: instance
    - replacement: hw
      source_labels:
      - __address__
      target_label: juju_model
    - replacement: dfaf0254-3e9c-4684-8575-5b86266f1581
      source_labels:
      - __address__
      target_label: juju_model_uuid
    - replacement: hardware-observer
      source_labels:
      - __address__
      target_label: juju_application
    - replacement: hardware-observer/58
      source_labels:
      - __address__
      target_label: juju_unit
    sysctl_include:
    - net.ipv4.neigh.default.gc_thresh3
  prometheus_remote_write:
  - tls_config:
      insecure_skip_verify: false
    url: http://10.245.130.50/cos-prometheus-0/api/v1/write
logs:
  configs:
  - clients: []
    name: log_file_scraper
    scrape_configs:
    - job_name: varlog
      pipeline_stages:
      - drop:
          expression: .*file is a directory.*
      static_configs:
      - labels:
          __path__: /var/log/*log
          instance: hw_dfaf0254-3e9c-4684-8575-5b86266f1581_hardware-observer_hardware-observer/58
          juju_application: hardware-observer
          juju_model: hw
          juju_model_uuid: dfaf0254-3e9c-4684-8575-5b86266f1581
          juju_unit: hardware-observer/58
        targets:
        - localhost
    - job_name: syslog
      journal:
        labels:
          instance: hw_dfaf0254-3e9c-4684-8575-5b86266f1581_hardware-observer_hardware-observer/58
          juju_application: hardware-observer
          juju_model: hw
          juju_model_uuid: dfaf0254-3e9c-4684-8575-5b86266f1581
          juju_unit: hardware-observer/58
      pipeline_stages:
      - drop:
          expression: .*file is a directory.*
  positions_directory: ${SNAP_DATA}/grafana-agent-positions
metrics:
  configs:
  - name: agent_scraper
    remote_write:
    - tls_config:
        insecure_skip_verify: false
      url: http://10.245.130.50/cos-prometheus-0/api/v1/write
    scrape_configs:
    - job_name: hardware-observer_0
      metrics_path: /metrics
      static_configs:
      - labels:
          instance: hw_dfaf0254-3e9c-4684-8575-5b86266f1581_hardware-observer_hardware-observer/58
          juju_application: hardware-observer
          juju_model: hw
          juju_model_uuid: dfaf0254-3e9c-4684-8575-5b86266f1581
          juju_unit: hardware-observer/58
        targets:
        - localhost:10000
  wal_directory: /tmp/agent/data
server:
  log_level: info
dashmage commented 1 year ago

Related to existing issue #190

simskij commented 1 year ago

Closing this as a duplicate of #190