canonical / avalanche-k8s-operator

https://charmhub.io/avalanche-k8s-operator
Apache License 2.0
3 stars 1 forks source link

Surprises with the alert based on the `absent` function #11

Closed sed-i closed 2 years ago

sed-i commented 2 years ago

Bug Description

There are two "always firing" rules in avalanche - one that is based on absent, and one that is based on metric value:

There are a few surprising things about the alert that is coming from the rule with absent:

  1. It doesn't have an instance label to it, unlike the alert coming from the other rule file, which does have it.
  2. It doesn't have a juju_unit label to it, unlike the alert coming from the other rule file, which does have it.
  3. It is emitted once per app, instead of once per unit like the other rule does.

To Reproduce

  1. git clone https://github.com/canonical/cos-lite-bundle
  2. cd cos-lite-bundle
  3. tox -e integration -- --keep-models

Environment

Model             Controller  Cloud/Region        Version  SLA          Timestamp
test-bundle-kjp4  newstuff    microk8s/localhost  2.9.25   unsupported  00:12:21Z

App           Version  Status  Scale  Charm             Store     Channel  Rev  OS          Address         Message
alertmanager           active      1  alertmanager-k8s  charmhub  edge      10  kubernetes  10.152.183.144  
avalanche              active      2  avalanche-k8s     charmhub  edge      15  kubernetes  10.152.183.140  
grafana                active      1  grafana-k8s       charmhub  edge      29  kubernetes  10.152.183.213  
loki                   active      1  loki-k8s          charmhub  edge      15  kubernetes  10.152.183.231  
prometheus             active      1  prometheus-k8s    charmhub  edge      20  kubernetes  10.152.183.20   

Unit             Workload  Agent  Address       Ports  Message
alertmanager/0*  active    idle   10.1.179.223         
avalanche/0*     active    idle   10.1.179.224         
avalanche/1      active    idle   10.1.179.226         
grafana/0*       active    idle   10.1.179.228         
loki/0*          active    idle   10.1.179.230         
prometheus/0*    active    idle   10.1.179.227         

Offer                         Application   Charm             Rev  Connected  Endpoint           Interface          Role
alertmanager-karma-dashboard  alertmanager  alertmanager-k8s  10   0/0        karma-dashboard    karma_dashboard    provider
grafana-dashboards            grafana       grafana-k8s       29   0/0        grafana-dashboard  grafana_dashboard  requirer
loki-logging                  loki          loki-k8s          15   0/0        logging            loki_push_api      provider
prometheus-scrape             prometheus    prometheus-k8s    20   0/0        metrics-endpoint   prometheus_scrape  requirer

Relation provider            Requirer                     Interface              Type     Message
alertmanager:alerting        prometheus:alertmanager      alertmanager_dispatch  regular  
alertmanager:replicas        alertmanager:replicas        alertmanager_replica   peer     
avalanche:metrics-endpoint   prometheus:metrics-endpoint  prometheus_scrape      regular  
avalanche:replicas           avalanche:replicas           avalanche_replica      peer     
grafana:grafana              grafana:grafana              grafana_peers          peer     
loki:grafana-source          grafana:grafana-source       grafana_datasource     regular  
prometheus:grafana-source    grafana:grafana-source       grafana_datasource     regular  
prometheus:prometheus-peers  prometheus:prometheus-peers  prometheus_peers       peer     

Relevant log output

See "Additional context".

Additional context

Alert rendered form absent doesn't have an instance label

Note how the annotations have empty strings where instance should have gone:

      "description": " of job non_existing_job is firing the dummy alarm.",
      "summary": "Instance  dummy alarm (always firing)"

Also note how juju_unit is missing from alert labels.

Relevant prometheus output:

> $ curl -s 10.1.179.227:9090/api/v1/alerts | jq
{
  {
    [
      {
        "labels": {
          "alertname": "AlwaysFiringDueToAbsentMetric",
          "job": "non_existing_job",
          "juju_application": "avalanche",
          "juju_charm": "avalanche-k8s",
          "juju_model": "test-bundle-kjp4",
          "juju_model_uuid": "6376405a-54dc-45b1-8eb8-b42f96d51f12",
          "severity": "High"
        },
        "annotations": {
          "description": " of job non_existing_job is firing the dummy alarm.",
          "summary": "Instance  dummy alarm (always firing)"
        },
        "state": "firing",
        "activeAt": "2022-03-10T23:29:31.289307407Z",
        "value": "1e+00"
      }
    ]
  }
}

Relevant alertmanager output:

> $ curl -s 10.1.179.223:9093/api/v2/alerts | jq
[
  {
    "annotations": {
      "description": " of job non_existing_job is firing the dummy alarm.",
      "summary": "Instance  dummy alarm (always firing)"
    },
    "endsAt": "2022-03-10T23:51:31.289Z",
    "fingerprint": "bc3f0f827af3d64d",
    "receivers": [
      {
        "name": "dummy"
      }
    ],
    "startsAt": "2022-03-10T23:29:31.289Z",
    "status": {
      "inhibitedBy": [],
      "silencedBy": [],
      "state": "active"
    },
    "updatedAt": "2022-03-10T23:47:31.293Z",
    "generatorURL": "http://10.1.179.227:9090/graph?g0.expr=absent%28some_metric_name_that_shouldnt_exist%7Bjob%3D%22non_existing_job%22%7D%29&g0.tab=1",
    "labels": {
      "alertname": "AlwaysFiringDueToAbsentMetric",
      "job": "non_existing_job",
      "juju_application": "avalanche",
      "juju_charm": "avalanche-k8s",
      "juju_model": "test-bundle-kjp4",
      "juju_model_uuid": "6376405a-54dc-45b1-8eb8-b42f96d51f12",
      "severity": "High"
    }
  }
]
simskij commented 2 years ago

This is expected. A rule that is triggering due to the absence of metrics won't have any time series to get its labels from.