dalibo / check_patroni

A nagios plugin for patroni.
PostgreSQL License
7 stars 3 forks source link

patroni update changed state from "running" to "streaming" for replica nodes #28

Closed log1-c closed 1 year ago

log1-c commented 1 year ago

With the latest update to v3.0.4 patroni changed it state string for replica nodes https://patroni.readthedocs.io/en/latest/releases.html#version-3-0-4

# patronictl -c /opt/patroni/etc/postgresql.yml list
+ Cluster: postgres ------------+---------+-----------+----+-----------+
| Member        | Host          | Role    | State     | TL | Lag in MB |
+---------------+---------------+---------+-----------+----+-----------+
| xyz-abcd-db01 | 11.111.1.111  | Leader  | running   | 66 |           |
| xyz-abcd-db02 | 11.111.1.111  | Replica | streaming | 66 |         0 |
| xyz-abcd-db03 | 11.111.1.111  | Replica | streaming | 66 |         0 |
+---------------+---------------+---------+-----------+----+-----------+

Sadly I couldn't quite figure out how the counting is done exactly or I would have included a PR.

Cheers and a big thanks for the check!

log1-c commented 1 year ago

Possible fix:

Change line 37 of check_patroni/cluster.py to

yield nagiosplugin.Metric("state_running", status_counters["running"] + status_counters.get("streaming", 0))

That way it is also compatible with previous versions

blogh commented 1 year ago

Hi sorry for the long delay, I was in holiday. Will look into this asap.

blogh commented 1 year ago

Hello,

I made some additional changes beyond my initial plan. While the check you originally suggested was satisfactory, the performance data it provided turned out to be misleading:

CLUSTERNODECOUNT OK - members is 2 | members=2 role_leader=1 role_replica=1 state_running=2 state_streaming=1

To address this, I introduced a new healthy_members performance data value, combining running and streaming nodes statuses. This adjustment ensures that we continue monitoring the same states as before and maintain accurate checks.

The revised output of the check is as follows:

CLUSTERNODECOUNT OK - members is 2 | healthy_members=2 members=2 role_leader=1 role_replica=1 state_running=1 state_streaming=1

The key modifications are:

* The existing `--running-[warning|critical]` option is now designated
  as `--healthy-[warning|critical]`.
* Introduction of the `healthy_member` perfdata, which serves as the
  reference point for the aforementioned options.
* Updates to documentation, help messages, and tests.

I plan to commit these changes soon and will be addressing a few other issues throughout the week. Hopefully, I'll be able to finalize a new release by the end of the week.

Thank you.