dalibo / check_patroni

A nagios plugin for patroni.
PostgreSQL License
7 stars 3 forks source link

cluster_has_replica: fix the way a healthy replica is detected #54

Closed blogh closed 11 months ago

blogh commented 1 year ago

For patroni >= version 3.0.4:

For prio versions:

blogh commented 1 year ago

cf #50

blogh commented 1 year ago

I still need to fix the tests and try on older supported python versions.

mbanck commented 1 year ago

If I read the changes correctly, this also adds the timeline to the perfdata? That might warrant a release notes item as well then.

blogh commented 1 year ago

You are right, I changed it. I'll probably continue next week. I am booked for a client this week.

blogh commented 1 year ago

Hi @mbanck,

Do you want to review it ?

blogh commented 12 months ago

I think this is still wrong.

From PostgreSQL's perspective, a healthy standby could be streaming or in archive recovery (we don't use slots and use log shipping to catchup). And if we look at is_healthiest_node or is_failover_possible, Patroni doesn't care about the state of the node either (maybe I missed it ?)

It checks things like :

So I think we should do something like

if version < 3.0.4:
   if state = "running" and TL = leader TL:
       test for lag if needed
       the node is healthy

if version >= 3.0.4:
   if state in ["streaming", "in archive recovery"] and TL = leader TL:
       test for lag if needed
       the node is healthy

I don't know what to do about nodes with a nofailover tag. Maybe exclude them if we use a new --exclude-nofailover-tag option ?

mbanck commented 11 months ago

I guess in archive recovery means the standby is currently catching up; whether that is healthy or not could then be checked via lag. So I think the above is fine.

I am also not sure what to do about nofailover tags, but in my opinion, this is orthogonal to whether a node is healthy or not.

blogh commented 11 months ago

@dlax could you have another look please ?