cluster_has_replica: fix the way a healthy replica is detected

blogh commented 1 year ago

For patroni >= version 3.0.4:

the role is replica or sync_standby
the state is streaming
the lag is lower or equal to max_lag

For prio versions:

the role is replica or sync_standby
the state is running and with the same timeline has the leader
the lag is lower or equal to max_lag

blogh commented 1 year ago

cf #50

blogh commented 1 year ago

I still need to fix the tests and try on older supported python versions.

mbanck commented 1 year ago

If I read the changes correctly, this also adds the timeline to the perfdata? That might warrant a release notes item as well then.

blogh commented 1 year ago

You are right, I changed it. I'll probably continue next week. I am booked for a client this week.

blogh commented 1 year ago

Hi @mbanck,

Do you want to review it ?

blogh commented 12 months ago

I think this is still wrong.

From PostgreSQL's perspective, a healthy standby could be streaming or in archive recovery (we don't use slots and use log shipping to catchup). And if we look at is_healthiest_node or is_failover_possible, Patroni doesn't care about the state of the node either (maybe I missed it ?)

It checks things like :

the timeline matches the leader's timeline (we do it only for patroni < 3.0.4)
the lag is lower or equal to maximum_lag_on_failover (we do it if --max-lag is used)
the nofailover tag is present (we don't check for that)
the watchdog is available (we don't check for that, and I think we can't do it from the API)
the cluster is not paused (we don't check for that here but there is a dedicated service for that)

So I think we should do something like

if version < 3.0.4:
   if state = "running" and TL = leader TL:
       test for lag if needed
       the node is healthy

if version >= 3.0.4:
   if state in ["streaming", "in archive recovery"] and TL = leader TL:
       test for lag if needed
       the node is healthy

I don't know what to do about nodes with a nofailover tag. Maybe exclude them if we use a new --exclude-nofailover-tag option ?

mbanck commented 11 months ago

I guess in archive recovery means the standby is currently catching up; whether that is healthy or not could then be checked via lag. So I think the above is fine.

I am also not sure what to do about nofailover tags, but in my opinion, this is orthogonal to whether a node is healthy or not.

blogh commented 11 months ago

@dlax could you have another look please ?

dalibo / check_patroni

cluster_has_replica: fix the way a healthy replica is detected #54