dalibo / check_patroni

A nagios plugin for patroni.
PostgreSQL License
7 stars 3 forks source link

(optionally) only consider streaming nodes to be healthy #50

Closed mbanck closed 9 months ago

mbanck commented 1 year ago

In #30 the streaming state was added as an alternative to running. However, I don't agree with this - if the Patroni version is 3.0.4 or up, a state of merely running indicates an unhealthy (streaming replication not working) node.

So it would be good to have an option for the cluster_node_count command to only count streaming (replicas and standby leaders) as healthy.

blogh commented 1 year ago

Good catch :

-bash-4.2$ patroni --version
patroni 3.1.1

-bash-4.2$ patronictl --config-file /etc/patroni/demo.yaml list
+ Cluster: patroni-demo (7283074439919541064) ----+-----------+
| Member | Host        | Role    | State     | TL | Lag in MB |
+--------+-------------+---------+-----------+----+-----------+
| p1     | 10.20.30.51 | Replica | running   |  4 |       173 |
| p2     | 10.20.30.52 | Leader  | running   |  8 |           |
| p3     | 10.20.30.53 | Replica | streaming |  8 |         0 |
+--------+-------------+---------+-----------+----+-----------+
(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.53:8008 cluster_has_replica   
CLUSTERHASREPLICA OK - healthy_replica is 2
| healthy_replica=2 p1_lag=181687224 p1_sync=0 p3_lag=0 p3_sync=0 sync_replica=0 unhealthy_replica=0
blogh commented 1 year ago

The node_is_replica is also kinda useless as is ..

Tests with the same cluster:

(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.51:8008 node_is_replica       
NODEISREPLICA OK - This node is a running replica with no noloadbalance tag.
| is_replica=1;;@0

(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.53:8008 node_is_replica
NODEISREPLICA OK - This node is a running replica with no noloadbalance tag.
| is_replica=1;;@0

The check/description would be more usefull if we checked for healthy replicas.

(.venv) [vagrant@p2 check_patroni]$ check_patroni node_is_replica --help
Usage: check_patroni node_is_replica [OPTIONS]

  Check if the node is a running replica with no noloadbalance tag.

  It is possible to check if the node is synchronous or asynchronous. If
  nothing is specified any kind of replica is accepted. When checking for a
  synchronous replica, it's not possible to specify a lag.

  Check:
  * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
  * `CRITICAL`:  otherwise

  Perfdata: `is_replica` is 1 if the node is a running replica with
  noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.

What's your opinion on this ?

mbanck commented 1 year ago

I don't know about the node_is_replica check, as the roles can be dynamic, I don't have a good use-case for this; maybe there should be a useful default for max-lag though? Patroni's kinda-default is 1MB I belive (maximum_lag_on_failover: 104857).

I don't have a great opinion on what the UX should be, except I think the less complicated the better, and it seems useless to check for a running but non-working (or lagging) replica.

So I think a replica should be considered healthy when it is streaming or - for older Patroni versions - when it is on the same timline as the local leader and has no lag.

mbanck commented 1 year ago

Well, put otherwise: I had a look at #30 and I noticed that it does not seem to touch node_is_replica.

Maybe a second check node_is_follower or so could be added that does the right thing, or node_is_replica's behavior changed to do so (but probably warranting a major version bump then), as check-patroni is a relatively young project.

blogh commented 11 months ago

Fixed thanks for reporting this.

blogh commented 11 months ago

Re opened, there is still the cluster_node_count issue to fix.

blogh commented 9 months ago

Done here: https://github.com/dalibo/check_patroni/pull/66