Closed mbanck closed 9 months ago
Good catch :
-bash-4.2$ patroni --version
patroni 3.1.1
-bash-4.2$ patronictl --config-file /etc/patroni/demo.yaml list
+ Cluster: patroni-demo (7283074439919541064) ----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-------------+---------+-----------+----+-----------+
| p1 | 10.20.30.51 | Replica | running | 4 | 173 |
| p2 | 10.20.30.52 | Leader | running | 8 | |
| p3 | 10.20.30.53 | Replica | streaming | 8 | 0 |
+--------+-------------+---------+-----------+----+-----------+
(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.53:8008 cluster_has_replica
CLUSTERHASREPLICA OK - healthy_replica is 2
| healthy_replica=2 p1_lag=181687224 p1_sync=0 p3_lag=0 p3_sync=0 sync_replica=0 unhealthy_replica=0
The node_is_replica
is also kinda useless as is ..
Tests with the same cluster:
(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.51:8008 node_is_replica
NODEISREPLICA OK - This node is a running replica with no noloadbalance tag.
| is_replica=1;;@0
(.venv) [vagrant@p2 check_patroni]$ check_patroni -vv -e http://10.20.30.53:8008 node_is_replica
NODEISREPLICA OK - This node is a running replica with no noloadbalance tag.
| is_replica=1;;@0
The check/description would be more usefull if we checked for healthy replicas.
(.venv) [vagrant@p2 check_patroni]$ check_patroni node_is_replica --help
Usage: check_patroni node_is_replica [OPTIONS]
Check if the node is a running replica with no noloadbalance tag.
It is possible to check if the node is synchronous or asynchronous. If
nothing is specified any kind of replica is accepted. When checking for a
synchronous replica, it's not possible to specify a lag.
Check:
* `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold.
* `CRITICAL`: otherwise
Perfdata: `is_replica` is 1 if the node is a running replica with
noloadbalance tag and the lag is under the maximum threshold, 0 otherwise.
What's your opinion on this ?
I don't know about the node_is_replica
check, as the roles can be dynamic, I don't have a good use-case for this; maybe there should be a useful default for max-lag
though? Patroni's kinda-default is 1MB I belive (maximum_lag_on_failover: 104857
).
I don't have a great opinion on what the UX should be, except I think the less complicated the better, and it seems useless to check for a running but non-working (or lagging) replica.
So I think a replica should be considered healthy when it is streaming or - for older Patroni versions - when it is on the same timline as the local leader and has no lag.
Well, put otherwise: I had a look at #30 and I noticed that it does not seem to touch node_is_replica
.
Maybe a second check node_is_follower
or so could be added that does the right thing, or node_is_replica
's behavior changed to do so (but probably warranting a major version bump then), as check-patroni is a relatively young project.
Fixed thanks for reporting this.
Re opened, there is still the cluster_node_count
issue to fix.
In #30 the
streaming
state was added as an alternative torunning
. However, I don't agree with this - if the Patroni version is 3.0.4 or up, a state of merelyrunning
indicates an unhealthy (streaming replication not working) node.So it would be good to have an option for the cluster_node_count command to only count streaming (replicas and standby leaders) as healthy.