The charm cannot recover from a quorum loss event of 3-node cluster

nobuto-m commented 1 month ago

Steps to reproduce

Prepare a MAAS provider
deploy a 3-node cluster juju deploy postgresql --base ubuntu@22.04 --channel 14/stable -n 3
Take down the primary and one more unit to simulate a quorum loss event by losing 2 out of 3 nodes

Expected behavior

The cluster should stop accepting a write request to the PostgreSQL since it's a quorum loss event. However, the replica is valid in the living node out of 3 so the charm should be able to recover the cluster from the replica.

Actual behavior

The charm gets stuck at waiting for primary to be reachable from this unit and awaiting for member to start. Also Patroni configuration hasn't been recovered to be functional.

initial status

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:25:53Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.117  5432/tcp
postgresql/1   active    idle   1        192.168.151.118  5432/tcp  Primary
postgresql/2   active    idle   2        192.168.151.119  5432/tcp

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  ubuntu@22.04  default  Deployed
1        started  192.168.151.118  machine-8  ubuntu@22.04  default  Deployed
2        started  192.168.151.119  machine-1  ubuntu@22.04  default  Deployed

$ sudo -u snap_daemon patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml topology
+ Cluster: postgresql (7399642793178039038) ------+-----------+----+-----------+
| Member         | Host            | Role         | State     | TL | Lag in MB |
+----------------+-----------------+--------------+-----------+----+-----------+
| postgresql-1   | 192.168.151.118 | Leader       | running   |  1 |           |
| + postgresql-0 | 192.168.151.117 | Sync Standby | streaming |  1 |         0 |
| + postgresql-2 | 192.168.151.119 | Replica      | streaming |  1 |         0 |
+----------------+-----------------+--------------+-----------+----+-----------+

after taking down the Leader and Sync Standby

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:39:18Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    1/3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.117  5432/tcp  
postgresql/1   unknown   lost   1        192.168.151.118  5432/tcp  agent lost, see 'juju show-status-log postgresql/1'
postgresql/2   unknown   lost   2        192.168.151.119  5432/tcp  agent lost, see 'juju show-status-log postgresql/2'

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  ubuntu@22.04  default  Deployed
1        down     192.168.151.118  machine-8  ubuntu@22.04  default  Deployed
2        down     192.168.151.119  machine-1  ubuntu@22.04  default  Deployed

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 13:38:25,462 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 13:38:30,529 - INFO - waiting on raft
2024-08-05 13:38:35,530 - INFO - waiting on raft
2024-08-05 13:38:40,530 - INFO - waiting on raft
2024-08-05 13:38:45,531 - INFO - waiting on raft
2024-08-05 13:38:50,532 - INFO - waiting on raft
2024-08-05 13:38:55,532 - INFO - waiting on raft
2024-08-05 13:39:00,533 - INFO - waiting on raft
^C
Aborted!

-> the quorum loss is expected here.

cleanup of dead nodes

$ juju remove-machine --force 1
WARNING This command will perform the following actions:
will remove machine 1
- will remove unit postgresql/1
- will remove storage pgdata/1

Continue [y/N]? y

$ juju remove-machine --force 2
WARNING This command will perform the following actions:
will remove machine 2
- will remove unit postgresql/2
- will remove storage pgdata/2

Continue [y/N]? y

-> remove-machine --force was used on purpose since remove-unit is no-op when the agent is not reachable.

after cleanup

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:42:14Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      1  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  active    idle   0        192.168.151.117  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  ubuntu@22.04  default  Deployed

-> status looks okay except for the fact that there is no "Primary" line

machine-7:~$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 13:43:18,378 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 13:43:23,445 - INFO - waiting on raft
2024-08-05 13:43:28,446 - INFO - waiting on raft
2024-08-05 13:43:33,446 - INFO - waiting on raft
^C
Aborted!

-> Patroni is still not working

$ sudo cat /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml

...

raft:
  data_dir: /var/snap/charmed-postgresql/current/etc/patroni/raft
  self_addr: '192.168.151.117:2222'
  partner_addrs:
  - 192.168.151.118:2222
  - 192.168.151.119:2222

...

  pg_hba:
    - local all backup peer map=operator
    - local all operator scram-sha-256
    - local all monitoring password
    - host replication replication 127.0.0.1/32 md5
    - host all all 0.0.0.0/0 md5
    # Allow replications connections from other cluster members.
    - host     replication    replication    192.168.151.118/0    md5

    - host     replication    replication    192.168.151.119/0    md5

...

-> there are left overs of dead unit configurations.

adding two nodes to form the 3-node cluster again

$ juju add-unit postgresql -n 2

after adding two nodes

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  13:57:16Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active      3  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0*  waiting   idle   0        192.168.151.117  5432/tcp  waiting for primary to be reachable from this unit
postgresql/3   waiting   idle   3        192.168.151.120  5432/tcp  awaiting for member to start
postgresql/4   waiting   idle   4        192.168.151.121  5432/tcp  awaiting for member to start

Machine  State    Address          Inst id    Base          AZ       Message
0        started  192.168.151.117  machine-7  ubuntu@22.04  default  Deployed
3        started  192.168.151.120  machine-8  ubuntu@22.04  default  Deployed
4        started  192.168.151.121  machine-1  ubuntu@22.04  default  Deployed

-> juju status doesn't settle.

$ sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 13:55:31,623 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 13:55:38,696 - INFO - waiting on raft
2024-08-05 13:55:43,696 - INFO - waiting on raft
2024-08-05 13:55:48,697 - INFO - waiting on raft
2024-08-05 13:55:53,697 - INFO - waiting on raft
2024-08-05 13:55:58,698 - INFO - waiting on raft
^C
Aborted!

-> Patroni hasn't been recovered

$ sudo cat /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml

...

raft:
  data_dir: /var/snap/charmed-postgresql/current/etc/patroni/raft
  self_addr: '192.168.151.117:2222'
  partner_addrs:
  - 192.168.151.119:2222
  - 192.168.151.121:2222
  - 192.168.151.118:2222
  - 192.168.151.120:2222

...

  pg_hba:
    - local all backup peer map=operator
    - local all operator scram-sha-256
    - local all monitoring password
    - host replication replication 127.0.0.1/32 md5
    - host all all 0.0.0.0/0 md5
    # Allow replications connections from other cluster members.
    - host     replication    replication    192.168.151.119/0    md5

    - host     replication    replication    192.168.151.121/0    md5

    - host     replication    replication    192.168.151.118/0    md5

    - host     replication    replication    192.168.151.120/0    md5

-> Patroni config still has leftovers. It has a 5-node cluster config instead of 3-node cluster.

Versions

Operating system: jammy

Juju CLI: 3.5.3-genericlinux-amd64

Juju agent: 3.5.3

Charm revision: 14/stable 429

LXD: N/A

Log output

Juju debug log: 3-node-recovery_debug.log

Additional context

github-actions[bot] commented 1 month ago

https://warthogs.atlassian.net/browse/DPE-5045

taurus-forever commented 4 weeks ago

Hi @nobuto-m , thank you for the well prepared bug report!

After the detailed investigation:

1) charm stuck is a known issue (duplicate of https://github.com/canonical/postgresql-operator/issues/418 => https://warthogs.atlassian.net/browse/DPE-3684 ), we should continue discussion there. TL;DR: The pySyncObj raft implementation is not-fixable. We tried to workaround this here no luck so far, exploring other options right now. In general Raft quorum works for 3+ nodes only.

2) Expected behavior: The cluster should stop accepting a write request to the PostgreSQL since it's a quorum loss event. However, the replica is valid in the living node out of 3 so the charm should be able to recover the cluster from the replica.

The initial idea was to elect new primary, continue writing there and all nodes rejoin the cluster.

It failed due to 1) above. New primary is not elected. Will be addressed in DPE-3684.

The `stop accepting a write request` should be performed by Patroni once quorum loss event noticed => didn't happen due to stuck in pySyncObj. Once we fix/replace the lib the behavior we have should match your expectation.

canonical / postgresql-operator