Closed Wain13 closed 6 months ago
I was wondering why I was getting that alert during node upgrades, but dismissed it as expected, but you're correct and there is indeed a problem with it.
I noticed two additional issues.
=~
which seems totally unnecessary.OR on() vector(0)
So the whole query should look like this:
(count(cnpg_collector_up{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR on() vector(0)) == 0
@Wain13 Do you mind providing a review in #291?
It is possible I've completely lost my mind here, but I think:
in https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/prometheus_rules/cluster-offline.yaml The rule expression:
({{ .Values.cluster.instances }} - count(cnpg_collector_up{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR vector(0)) > 0
Will trigger when any instance in a cluster is missing, but the alert is a critical alert that is supposed to trigger only when all instances are missing.
In a 3 instance cluster, if one of the pods gets rescheduled and is offline the above expression evaluates to: 3 - 2 > 0 and triggers the alert status, but there are currently 2 instances up and running at that moment, the cluster is available.
Presumably it should be something like:
(count(cnpg_collector_up{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR vector(0)) == 0