Cluster Prometheus Rule CNPGClusterOffline is triggered when there are still instances up

Wain13 commented 1 month ago

It is possible I've completely lost my mind here, but I think:

in https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/prometheus_rules/cluster-offline.yaml The rule expression: ({{ .Values.cluster.instances }} - count(cnpg_collector_up{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR vector(0)) > 0

Will trigger when any instance in a cluster is missing, but the alert is a critical alert that is supposed to trigger only when all instances are missing.

In a 3 instance cluster, if one of the pods gets rescheduled and is offline the above expression evaluates to: 3 - 2 > 0 and triggers the alert status, but there are currently 2 instances up and running at that moment, the cluster is available.

Presumably it should be something like: (count(cnpg_collector_up{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR vector(0)) == 0

itay-grudev commented 1 month ago

I was wondering why I was getting that alert during node upgrades, but dismissed it as expected, but you're correct and there is indeed a problem with it.

itay-grudev commented 1 month ago

I noticed two additional issues.

The namespace selector is using pattern matching =~ which seems totally unnecessary.
The missing data fix should be: OR on() vector(0)

So the whole query should look like this:

(count(cnpg_collector_up{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR on() vector(0)) == 0

itay-grudev commented 1 month ago

@Wain13 Do you mind providing a review in #291?

cloudnative-pg / charts

Cluster Prometheus Rule CNPGClusterOffline is triggered when there are still instances up #283