Closed jaredkipe closed 8 months ago
Hi - the chosen defaults are "just" what patroni brings along -> see https://patroni.readthedocs.io/en/latest/SETTINGS.html
Regarding your settings, they are somewhat mixed up.
patroni.leaderLeaseDurationSeconds
is mapped to ttl
patroni.syncPeriodSeconds
is mapped to loop_wait
Also leaderLeaseDurationSeconds
/ ttl
should be larger than syncPeriodSeconds
/ loop_wait
- the settings somehow violate this.
They should even be kept consistent with each other: ttl > loop_wait + 2 * retry_timeout
My assumption for more stability is that raising ttl should be enough to deal with non-HA DCS setups. Patroni would then just get more chances to retry before the master is demoted. Would be great what the Crunchydata experiences are in these situations.
@jaredkipe I am facing the same issue, can you share the config that fixed this for you?
@daadu yeah its the block of patroni config. I wouldn't say it is solved as much as delayed, but it has more or less solved it for us.
I think this will be improved a lot with Patroni 3.x and the DCS fail_safe configuration: see https://github.com/zalando/patroni/pull/2379 and https://github.com/zalando/patroni/blob/master/docs/releases.rst#version-300
As noted above, this has been addressed with the "failsafe" functionality that is now available in Patroni:
https://patroni.readthedocs.io/en/master/dcs_failsafe_mode.html
And considering the latest versions of Crunchy Postgres for Kubernetes also include this change, proceeding with closing this issue.
Overview
Currently, 10s timeout gets broken down into ~5s timeout request and two ~2s timeout requests. This will cause leader elections with even moderate etcd unavailability (I believe
retry_timeout
on https://patroni.readthedocs.io/en/latest/SETTINGS.html )I have observed this with standalone and HA postgres clusters running on Linode LKE (none are HA LKE).
Use Case
Stand alone clusters could stay up even with the DCS down.
Is this the correct way to influence this configuration?
Desired Behavior
I personally believe the defaults should be relaxed, but barring that I'd love documentation on how to tweak this and what would be reasonable for stand alone clusters (stand alone clusters really shouldn't be leader electing, but I understand why it is...).
Environment
Tell us about your environment:
Please provide the following details:
Kubernetes
,LKE
)1.21.12
,5.0.5
)ubi8-5.0.5-0
)13
)hostpath
)1
)Additional Information
HA Kubernetes would probably help/resolve, but configuration would go a long way.