canonical / postgresql-operator

A Charmed Operator for running PostgreSQL on machines
https://charmhub.io/postgresql
Apache License 2.0
8 stars 19 forks source link

The charm allows a 2-node cluster but it's not functional after a failover #570

Open nobuto-m opened 1 month ago

nobuto-m commented 1 month ago

Steps to reproduce

  1. Prepare a MAAS provider
  2. deploy the charm with 2 units by following https://charmhub.io/postgresql/docs/h-scale juju deploy postgresql --base ubuntu@22.04 --channel 14/stable -n 2
  3. take down the primary unit

Expected behavior

It's either:

Actual behavior

Similar topic with https://github.com/canonical/postgresql-operator/issues/566.

Juju status looks okay at a glance. However, the living unit doesn't say which unit is the primary at the moment.

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  12:17:40Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    1/2  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0   unknown   lost   0        192.168.151.115  5432/tcp  agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*  active    idle   1        192.168.151.116  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        down     192.168.151.115  machine-7  ubuntu@22.04  default  Deployed
1        started  192.168.151.116  machine-8  ubuntu@22.04  default  Deployed

Also, the action states the dead unit is the primary, which shouldn't be.

$ juju run postgresql/leader get-primary
Running operation 3 with 1 task
  - task 4 on unit-postgresql-1

Waiting for task 4...
primary: postgresql/0

The patroni's member list cannot be fetched since the quorum of the raft was lost.

$ juju ssh postgresql/1 -- sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 12:20:16,176 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 12:20:21,243 - INFO - waiting on raft
2024-08-05 12:20:26,243 - INFO - waiting on raft
2024-08-05 12:20:31,244 - INFO - waiting on raft
2024-08-05 12:20:36,244 - INFO - waiting on raft
2024-08-05 12:20:41,245 - INFO - waiting on raft
2024-08-05 12:20:46,245 - INFO - waiting on raft
2024-08-05 12:20:51,246 - INFO - waiting on raft
2024-08-05 12:20:56,247 - INFO - waiting on raft
^C
Aborted!
Connection to 192.168.151.116 closed.

On a side note, the raft support is deprecated in patroni upstream as of 3.0.0. https://patroni.readthedocs.io/en/latest/releases.html#version-3-0-0

Versions

Operating system: jammy

Juju CLI: 3.5.3

Juju agent: 3.5.3

Charm revision: 14/stable 429

LXD: N/A

Log output

Juju debug log: model_debug.log

Additional context

github-actions[bot] commented 1 month ago

https://warthogs.atlassian.net/browse/DPE-5042

delgod commented 1 month ago

On a side note, the raft support is deprecated in patroni upstream as of 3.0.0.

Yes, the raft is not supported upstream, but it is supported and maintained by our Team for all our users (till some point in time).

nobuto-m commented 1 month ago

Looks like the upstream assumes two PostgreSQL nodes and one witness node. So my understanding is running the cluster only with two nodes is not supported.

https://patroni.readthedocs.io/en/latest/yaml_configuration.html#raft-deprecated

Q: It is possible to run Patroni and PostgreSQL only on two nodes?

A: Yes, on the third node you can run patroni_raft_controller (without Patroni and PostgreSQL). In such a setup, one can temporarily lose one node without affecting the primary.