While testing the alerts for the hardware-observer charm, I'm not able to see any of them firing even when the metric being exported was giving the fail value.
Example
For the redfish_call_success metric, I've provided wrong credentials and the querying the exporter shows that the metric has value 0.
ubuntu@bomberto:~$ curl localhost:10000
(...)
# HELP redfish_service_available Indicates if redfish service is available or not on the system.
# TYPE redfish_service_available gauge
redfish_service_available 1.0
# HELP redfish_call_success Indicates if call to the redfish API succeeded or not.
# TYPE redfish_call_success gauge
redfish_call_success 0.0
This is reflected on COS Prometheus as well
But the alert rule attached to the failure of this metric is not being fired.
Upon troubleshooting this for a bit, I found that the alert rule is being injected with an extra juju_charm key which is not present in the metric itself. So all the alerts that are supposed to be fired are not being triggered due to this extra key.
juju deploy microk8s
juju config microk8s addons="dns ingress hostpath-storage metallb:10.245.130.50-10.245.130.50"
# add microk8s cloud to controller
juju add-k8s micro -c my-ctrl
# add new model to cloud
juju add-model cos micro
juju deploy cos-lite --channel edge --trust
juju offer prometheus:receive-remote-write
Setting up the CMR
juju relate grafana-agent micro:cos.prometheus
Environment
Running juju on a MAAS cloud backend. Deployed all applications from latest/stable
❯ juju --version
2.9.44-ubuntu-amd64
❯ juju status
Model Controller Cloud/Region Version SLA Timestamp
hw ct-maas-ctrl ct-maas/default 2.9.43 unsupported 14:36:35+05:30
SAAS Status Store URL
prometheus active ct-maas-ctrl ashley/cos.prometheus
App Version Status Scale Charm Channel Rev Exposed Message
grafana-agent active 6 grafana-agent 4 no logging-consumer: off, grafana-cloud-config: off
hardware-observer error 6 hardware-observer 26 no hook failed: "upgrade-charm"
microk8s active 1 microk8s legacy/stable 101 no
ubuntu active 6 ubuntu latest/stable 24 no
Unit Workload Agent Machine Public address Ports Message
microk8s/0* active idle 13 10.1.10.204 80/tcp,443/tcp,16443/tcp
ubuntu/3* active idle 3 10.1.11.163
grafana-agent/41* active idle 10.1.11.163 grafana-cloud-config: off, logging-consumer: off
hardware-observer/59 error idle 10.1.11.163 hook failed: "upgrade-charm"
ubuntu/4 active idle 4 10.1.11.46
grafana-agent/42 active idle 10.1.11.46 logging-consumer: off, grafana-cloud-config: off
hardware-observer/58* active idle 10.1.11.46 Unit is ready
ubuntu/5 active idle 5 10.245.130.6
grafana-agent/44 active idle 10.245.130.6 grafana-cloud-config: off, logging-consumer: off
hardware-observer/60 blocked idle 10.245.130.6 Missing resources: ['sas2ircu-bin']
ubuntu/7 active idle 7 10.1.11.55
grafana-agent/40 active idle 10.1.11.55 grafana-cloud-config: off, logging-consumer: off
hardware-observer/56 active idle 10.1.11.55 Unit is ready
ubuntu/10 active idle 10 10.1.10.226
grafana-agent/43 active idle 10.1.10.226 logging-consumer: off, grafana-cloud-config: off
hardware-observer/61 blocked idle 10.1.10.226 Missing resources: ['sas2ircu-bin']
ubuntu/12 active idle 12 10.1.25.86
grafana-agent/39 active idle 10.1.25.86 logging-consumer: off, grafana-cloud-config: off
hardware-observer/57 blocked idle 10.1.25.86 Missing resources: ['storcli-deb']
Machine State Address Inst id Series AZ Message
3 started 10.1.11.163 coinfish jammy default Deployed
4 started 10.1.11.46 bomberto jammy default Deployed
5 started 10.245.130.6 gurley jammy Cert Lab Deployed
7 started 10.1.11.55 rozary jammy default Deployed
10 started 10.1.10.226 prunus jammy Cert Lab Deployed
12 started 10.1.25.86 kongfu jammy Cert Lab Deployed
13 started 10.1.10.204 birdo jammy default Deployed
Relevant log output
Output of running `juju show-unit grafana-agent/x`: https://pastebin.ubuntu.com/p/nDTjPnBx6W/
Bug Description
While testing the alerts for the hardware-observer charm, I'm not able to see any of them firing even when the metric being exported was giving the fail value.
Example For the
redfish_call_success
metric, I've provided wrong credentials and the querying the exporter shows that the metric has value 0.This is reflected on COS Prometheus as well
But the alert rule attached to the failure of this metric is not being fired.
Upon troubleshooting this for a bit, I found that the alert rule is being injected with an extra
juju_charm
key which is not present in the metric itself. So all the alerts that are supposed to be fired are not being triggered due to this extra key.To Reproduce
Setup of principal and subordinate charms
COS Setup on microk8s
Setting up the CMR
Environment
Running juju on a MAAS cloud backend. Deployed all applications from
latest/stable
Relevant log output
Additional context