EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

I encountered an issue where failover did not occur after the primary database experienced a network disruption #871

Open xiaojing413332 opened 3 days ago

xiaojing413332 commented 3 days ago

I encountered an issue where failover did not occur after the primary database experienced a network disruption. In a three-node repmgr setup, when the network to the primary PostgreSQL node is blocked, the standby repmgr nodes fail to detect the primary node’s failure. On all three servers, the status shows “unreachable,” but no actual failover occurs.

The standby nodes have no logs recorded, but running show reveals the exception: ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------------------+---------+-----------+------------------+----------+----------+----------+------------------------------------------------------------------------- 1 | pg_49536432_stage | standby | running | ? pg_19850_stage | default | 100 | 8 | host=xxxxxx port=6432 user=repmgr dbname=repmgr connect_timeout=2 2 | pg_211296432_stage | standby | running | ? pg_19850_stage | default | 100 | 8 | host=xxxxxxx port=6432 user=repmgr dbname=repmgr connect_timeout=2 3 | pg_19850_stage | primary | ? running | ? | default | 100 | | host=xxxxxxxx port=6432 user=repmgr dbname=repmgr connect_timeout=2

The repmgr.conf configuration details are as follows:

failover='automatic' priority=100 connection_check_type=query connection_check_query = 'SELECT 1' reconnect_attempts=6 reconnect_interval=5 monitor_interval_secs=2 primary_notification_timeout=20