EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.56k stars 251 forks source link

master node fails to automatically rejoin the cluster after recovery from failure #850

Open nuowei2543 opened 6 months ago

nuowei2543 commented 6 months ago

Hello, during my simulation of host failover, I stopped the master host's PostgreSQL instance, and the standby node successfully switched to become the new master node. However, when I restarted the original master node, it did not automatically rejoin the cluster as a standby node. version: ubuntu:20.4 postgresql:16.2 repmgrd:5.4.1

1、 postgres@ser-compute-01:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | running | | default | 100 | 3 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | node1 | default | 100 | 3 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | running | node1 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

2、on node1 execute command supervisorctl stop postgresql

3、postgres@ser-compute-02:~$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | - failed | ? | default | 100 | | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

4、on node1 execute command supervisorctl startpostgresql

5、postgres@ser-compute-02:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | ! running | | default | 100 | 1 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

So, I don't know why node1 is still the primary.

stephan-hahn commented 6 months ago

Hi, there is no inbuilt automatic rejoin. By just starting the old master again, you create a split brain scenario. But it's no problem to automatically rejoin the old master after promoting the new one via script.