Closed sed-i closed 3 months ago
Hi @sed-i
Yes, the nagios_host_context
was added in charm-nrpe in order to let us remove it from the juju_unit
label value. Alert rules need to be fixed
I've deployed:
Model Controller Cloud/Region Version SLA Timestamp
ubu lxd localhost/localhost 3.5.2.1 unsupported 13:55:51-03:00
SAAS Status Store URL
prometheus active microk8s admin/cos.prometheus
App Version Status Scale Charm Channel Rev Exposed Message
cos-proxy n/a active 1 cos-proxy latest/edge 101 no
nrpe active 1 nrpe latest/edge 125 no Ready
ubuntu 22.04 active 1 ubuntu latest/stable 24 no
Unit Workload Agent Machine Public address Ports Message
cos-proxy/0* active idle 1 10.222.104.142
ubuntu/0* active idle 0 10.222.104.154
nrpe/0* active idle 10.222.104.154 5666/tcp icmp Ready
Machine State Address Inst id Base AZ Message
0 started 10.222.104.154 juju-db211a-0 ubuntu@22.04 Running
1 started 10.222.104.142 juju-db211a-1 ubuntu@22.04 Running
Integration provider Requirer Interface Type Message
cos-proxy:downstream-prometheus-scrape prometheus:metrics-endpoint prometheus_scrape regular
nrpe:monitors cos-proxy:monitors monitors regular
ubuntu:juju-info nrpe:general-info juju-info subordinate
... cross-related to this:
Model Controller Cloud/Region Version SLA Timestamp
cos microk8s microk8s/localhost 3.5.2 unsupported 13:56:46-03:00
App Version Status Scale Charm Channel Rev Address Exposed Message
prom 2.52.0 active 1 prometheus-k8s latest/edge 210 10.152.183.138 no
Unit Workload Agent Address Ports Message
prom/0* active idle 10.1.9.215
Offer Application Charm Rev Connected Endpoint Interface Role
prometheus prom prometheus-k8s 210 1/1 metrics-endpoint prometheus_scrape requirer
Integration provider Requirer Interface Type Message
prom:prometheus-peers prom:prometheus-peers prometheus_peers peer
And all alert rules are inactive:
Seems juju_unit
labels in Alert rules are ok.
avg_over_time(command_status{command="check_conntrack",juju_unit="ubuntu/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_conntrack",juju_unit="ubuntu/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="ubuntu/0"}[10m]) == 1)
avg_over_time(command_status{command="check_systemd_scopes",juju_unit="ubuntu/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_systemd_scopes",juju_unit="ubuntu/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="ubuntu/0"}[10m]) == 1)
avg_over_time(command_status{command="check_reboot",juju_unit="ubuntu/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_reboot",juju_unit="ubuntu/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="ubuntu/0"}[10m]) == 1)
Perhaps the situation described by @sed-i is a leftover of a previous deployment... 🤔
What do you think @simskij ?
Looks like the difference might be in deploying cos-proxy in an older version (rev 58) first, then upgrading. We'll try to reproduce this scenario.
Managed to reproduce the issue and I think upgrading isn't part of what's the problem.
Reproduction scenario:
juju config nrpe nagios_host_context=testing-further
Fixed in #153.
For someone who hits this bug (I just did as part of a cloud handover), is there something we need to do other than "juju refresh" to pull the new charm? i.e. do we need to break and recreate relations, or should the refresh be sufficient?
Bug Description
Following up on #137.
In rev 101 we see metrics with labels such as:
However, the alerts we have in place for "absent" detection have different labels:
Note the difference:
juju_unit="NAGIOS-HOST-CONTEXT-aodh/0"
juju_unit\="aodh/0"
-- missing the nagios host contextAs a result, the
absent
triggers constantly.So it is correct that we do not want the nagios context as part of label, but we need to have the metric labels to match this as well. I.e. in #137 seems like we removed it from the scrape job but overlooked the alert labels.
To Reproduce
Deploy cos proxy rev 101
Environment
NTA
Relevant log output
Additional context
No response