Mismatch between labels in "absent" expr and actual metric lables

sed-i commented 3 months ago

Bug Description

Following up on #137.

In rev 101 we see metrics with labels such as:

command_status{
  command\="check_aodh-evaluator",
  dns_name\="REDACTED", host\="10.x.x.x", instance\="10.x.x.x:xxxx",
  job\="juju_openstack_c2eab20_NAGIOS_HOST_CONTEXT_aodh_0_check_aodh-evaluator_prometheus_scrape",
  juju_application\="aodh", juju_model\="openstack", juju_model_uuid\="REDACTED", juju_unit\="aodh/0"
}

However, the alerts we have in place for "absent" detection have different labels:

avg_over_time(
  command_status{command="check_aodh-evaluator",juju_unit="NAGIOS-HOST-CONTEXT-aodh/0"}[15m]) > 1
    or 
  (absent_over_time(command_status{command="check_aodh-evaluator",juju_unit="NAGIOS-HOST-CONTEXT-aodh/0"}[10m]) == 1) 
    or 
  (absent_over_time(up{juju_unit="NAGIOS-HOST-CONTEXT-aodh/0"}[10m]) == 1)

Note the difference:

juju_unit="NAGIOS-HOST-CONTEXT-aodh/0"
juju_unit\="aodh/0" -- missing the nagios host context

As a result, the absent triggers constantly.

So it is correct that we do not want the nagios context as part of label, but we need to have the metric labels to match this as well. I.e. in #137 seems like we removed it from the scrape job but overlooked the alert labels.

To Reproduce

Deploy cos proxy rev 101

graph LR
nrpe ---|monitors| cos-proxy ---|"(cmr)"| scrape-config --- prom

Environment

NTA

Relevant log output

NTA

Additional context

No response

Abuelodelanada commented 3 months ago

Hi @sed-i

Yes, the nagios_host_context was added in charm-nrpe in order to let us remove it from the juju_unit label value. Alert rules need to be fixed

Abuelodelanada commented 3 months ago

I've deployed:

Model  Controller  Cloud/Region         Version  SLA          Timestamp
ubu    lxd         localhost/localhost  3.5.2.1  unsupported  13:55:51-03:00

SAAS        Status  Store     URL
prometheus  active  microk8s  admin/cos.prometheus

App        Version  Status  Scale  Charm      Channel        Rev  Exposed  Message
cos-proxy  n/a      active      1  cos-proxy  latest/edge    101  no       
nrpe                active      1  nrpe       latest/edge    125  no       Ready
ubuntu     22.04    active      1  ubuntu     latest/stable   24  no       

Unit          Workload  Agent  Machine  Public address  Ports          Message
cos-proxy/0*  active    idle   1        10.222.104.142                 
ubuntu/0*     active    idle   0        10.222.104.154                 
  nrpe/0*     active    idle            10.222.104.154  5666/tcp icmp  Ready

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.222.104.154  juju-db211a-0  ubuntu@22.04      Running
1        started  10.222.104.142  juju-db211a-1  ubuntu@22.04      Running

Integration provider                    Requirer                     Interface          Type         Message
cos-proxy:downstream-prometheus-scrape  prometheus:metrics-endpoint  prometheus_scrape  regular      
nrpe:monitors                           cos-proxy:monitors           monitors           regular      
ubuntu:juju-info                        nrpe:general-info            juju-info          subordinate

... cross-related to this:

Model  Controller  Cloud/Region        Version  SLA          Timestamp
cos    microk8s    microk8s/localhost  3.5.2    unsupported  13:56:46-03:00

App   Version  Status  Scale  Charm           Channel      Rev  Address         Exposed  Message
prom  2.52.0   active      1  prometheus-k8s  latest/edge  210  10.152.183.138  no       

Unit     Workload  Agent  Address     Ports  Message
prom/0*  active    idle   10.1.9.215         

Offer       Application  Charm           Rev  Connected  Endpoint          Interface          Role
prometheus  prom         prometheus-k8s  210  1/1        metrics-endpoint  prometheus_scrape  requirer

Integration provider   Requirer               Interface         Type  Message
prom:prometheus-peers  prom:prometheus-peers  prometheus_peers  peer

And all alert rules are inactive:

Seems juju_unit labels in Alert rules are ok.

avg_over_time(command_status{command="check_conntrack",juju_unit="ubuntu/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_conntrack",juju_unit="ubuntu/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="ubuntu/0"}[10m]) == 1)

avg_over_time(command_status{command="check_systemd_scopes",juju_unit="ubuntu/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_systemd_scopes",juju_unit="ubuntu/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="ubuntu/0"}[10m]) == 1)

avg_over_time(command_status{command="check_reboot",juju_unit="ubuntu/0"}[15m]) > 1 or (absent_over_time(command_status{command="check_reboot",juju_unit="ubuntu/0"}[10m]) == 1) or (absent_over_time(up{juju_unit="ubuntu/0"}[10m]) == 1)

Perhaps the situation described by @sed-i is a leftover of a previous deployment... 🤔

What do you think @simskij ?

mmkay commented 3 months ago

Looks like the difference might be in deploying cos-proxy in an older version (rev 58) first, then upgrading. We'll try to reproduce this scenario.

mmkay commented 3 months ago

Managed to reproduce the issue and I think upgrading isn't part of what's the problem.

Reproduction scenario:

run the scenario above as described by @Abuelodelanada
update nrpe config: juju config nrpe nagios_host_context=testing-further
notice alerts have the nagios_host_context added to juju_unit, while metrics don't.

Screenshot from 2024-08-06 12-39-23

sed-i commented 3 months ago

Fixed in #153.

Vultaire commented 1 month ago

For someone who hits this bug (I just did as part of a cloud handover), is there something we need to do other than "juju refresh" to pull the new charm? i.e. do we need to break and recreate relations, or should the refresh be sufficient?

canonical / cos-proxy-operator