Open nobuto-m opened 1 month ago
Hi @nobuto-m , thank you for the well prepared bug report!
After the detailed investigation:
1) charm stuck
is a known issue (duplicate of https://github.com/canonical/postgresql-operator/issues/418 => https://warthogs.atlassian.net/browse/DPE-3684 ), we should continue discussion there. TL;DR: The pySyncObj raft implementation is not-fixable. We tried to workaround this here no luck so far, exploring other options right now. In general Raft quorum works for 3+ nodes only.
2) Expected behavior: The cluster should stop accepting a write request to the PostgreSQL since it's a quorum loss event. However, the replica is valid in the living node out of 3 so the charm should be able to recover the cluster from the replica.
The initial idea was to elect new primary, continue writing there and all nodes rejoin the cluster.
It failed due to 1) above. New primary is not elected. Will be addressed in DPE-3684.
The `stop accepting a write request` should be performed by Patroni once quorum loss event noticed => didn't happen due to stuck in pySyncObj. Once we fix/replace the lib the behavior we have should match your expectation.
Steps to reproduce
juju deploy postgresql --base ubuntu@22.04 --channel 14/stable -n 3
Expected behavior
The cluster should stop accepting a write request to the PostgreSQL since it's a quorum loss event. However, the replica is valid in the living node out of 3 so the charm should be able to recover the cluster from the replica.
Actual behavior
The charm gets stuck at
waiting for primary to be reachable from this unit
andawaiting for member to start
. Also Patroni configuration hasn't been recovered to be functional.initial status
after taking down the Leader and Sync Standby
-> the quorum loss is expected here.
cleanup of dead nodes
->
remove-machine --force
was used on purpose sinceremove-unit
is no-op when the agent is not reachable.after cleanup
-> status looks okay except for the fact that there is no "Primary" line
-> Patroni is still not working
-> there are left overs of dead unit configurations.
adding two nodes to form the 3-node cluster again
after adding two nodes
-> juju status doesn't settle.
-> Patroni hasn't been recovered
-> Patroni config still has leftovers. It has a 5-node cluster config instead of 3-node cluster.
Versions
Operating system: jammy
Juju CLI: 3.5.3-genericlinux-amd64
Juju agent: 3.5.3
Charm revision: 14/stable 429
LXD: N/A
Log output
Juju debug log: 3-node-recovery_debug.log
Additional context