EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

Repmgr cannot perform failover in a cluster that is not in a healthy state. #855

Open kwenzh opened 4 months ago

kwenzh commented 4 months ago

Problem Description:

I have deployed a repmgr-managed PostgreSQL cluster in a k8s cluster, with 1 primary and 2 replicas, referred to as A, B, and C, respectively. B is the primary node. Scenario 1: I introduced a probabilistic packet loss on the network card of node B, about 50%. I expected repmgr to elect a new primary from A and C. However, the tricky part is that since B is not completely down, when A and C perform do_standby_promote, the check get_primary_connection_quiet just happened to query B and successfully connected, so the promotion failed.

Scenario 2: I brought down the network card of B, making it completely unreachable. Now a new primary can be elected from A and C. But when I restore the network card of B, it suffers from a split-brain scenario; it does not transition from primary to replica unless the PostgreSQL on B is restarted.

Log Information:


[2024-07-24 14:31:13] [DEBUG] connection check type is "ping"
[2024-07-24 14:31:15] [DEBUG] replication lag in bytes is: 0
[2024-07-24 14:31:17] [DEBUG] connection check type is "ping"
2024-07-24 14:31:18.033 CST [212] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
        server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
cp: cannot stat '/bitnami/postgresql/backup/wal-back/00000008.history': No such file or directory
cp: cannot stat '/bitnami/postgresql/backup/wal-back/000000070000000000000061': No such file or directory
2024-07-24 14:31:18.040 CST [203] LOG:  invalid record length at 0/6118D820: wanted 24, got 0
[2024-07-24 14:31:22] [WARNING] unable to ping "user=repmgr password=xxxx host=node-B dbname=repmgr port=5432 connect_timeout=5"
[2024-07-24 14:31:22] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-07-24 14:31:22] [INFO] executing notification command for event "repmgrd_upstream_disconnect"
[2024-07-24 14:31:22] [DETAIL] command is:
  /opt/bitnami/repmgr/events/router.sh 2049659808 repmgrd_upstream_disconnect 1 "2024-07-24 14:31:22+0800" "unable to connect to upstream node \"node-B-0\" (ID: 854020671)"
[2024-07-24 14:31:22] [WARNING] unable to connect to upstream node "node-B-0" (ID: 854020671)
[2024-07-24 14:31:22] [INFO] checking state of node "node-B-0" (ID: 854020671), 1 of 3 attempts
[2024-07-24 14:31:23] [NOTICE] node "node-B-0" (ID: 854020671) has recovered, reconnecting
[2024-07-24 14:31:28] [ERROR] connection to database failed
[2024-07-24 14:31:28] [DETAIL] 
connection to server at "node-B", port 5432 failed: timeout expired
[2024-07-24 14:31:28] [NOTICE] unable to reconnect to node "node-B-0" (ID: 854020671)
[2024-07-24 14:31:28] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2024-07-24 14:31:33] [INFO] checking state of node "node-B-0" (ID: 854020671), 2 of 3 attempts
[2024-07-24 14:31:36] [WARNING] unable to ping "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-B port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:31:36] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-07-24 14:31:36] [INFO] sleeping up to 5 seconds until next reconnection attempt
[2024-07-24 14:31:41] [INFO] checking state of node "node-B-0" (ID: 854020671), 3 of 3 attempts
[2024-07-24 14:31:42] [WARNING] unable to ping "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-B port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:31:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-07-24 14:31:42] [WARNING] unable to reconnect to node "node-B-0" (ID: 854020671) after 3 attempts
[2024-07-24 14:31:42] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
2024-07-24 14:31:42.388 CST [200] LOG:  received SIGHUP, reloading configuration files
[2024-07-24 14:31:42] [INFO] sleeping 5 seconds
2024-07-24 14:31:42.388 CST [200] LOG:  parameter "wal_retrieve_retry_interval" changed to "86405000"
[2024-07-24 14:31:47] [NOTICE] killing WAL receiver with PID 10388
2024-07-24 14:31:47.388 CST [10388] FATAL:  terminating walreceiver process due to administrator command
cp: cannot stat '/bitnami/postgresql/backup/wal-back/00000008.history': No such file or directory
2024-07-24 14:31:47.392 CST [203] LOG:  waiting for WAL to become available at 0/6118D838
[2024-07-24 14:31:48] [INFO] WAL receiver with pid 10388 killed
[2024-07-24 14:31:49] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-A port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:31:49] [INFO] WAL receiver PID on node 1983988088 is 93062
[2024-07-24 14:31:49] [DEBUG] sleeping 1 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:50] [INFO] WAL receiver PID on node 1983988088 is 93062
[2024-07-24 14:31:50] [DEBUG] sleeping 2 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:51] [INFO] WAL receiver PID on node 1983988088 is 93062
[2024-07-24 14:31:51] [DEBUG] sleeping 3 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:52] [INFO] WAL receiver PID on node 1983988088 is 93062
[2024-07-24 14:31:52] [DEBUG] sleeping 4 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:53] [INFO] WAL receiver PID on node 1983988088 is 93062
[2024-07-24 14:31:53] [DEBUG] sleeping 5 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:54] [INFO] WAL receiver PID on node 1983988088 is 93062
[2024-07-24 14:31:54] [DEBUG] sleeping 6 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:55] [INFO] WAL receiver PID on node 1983988088 is 93062
[2024-07-24 14:31:55] [DEBUG] sleeping 7 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:56] [INFO] WAL receiver PID on node 1983988088 is 93214
[2024-07-24 14:31:56] [DEBUG] sleeping 8 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:57] [INFO] WAL receiver PID on node 1983988088 is 93214
[2024-07-24 14:31:57] [DEBUG] sleeping 9 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:58] [INFO] WAL receiver PID on node 1983988088 is 93214
[2024-07-24 14:31:58] [DEBUG] sleeping 10 of max 10 seconds ("sibling_nodes_disconnect_timeout")
[2024-07-24 14:31:59] [WARNING] WAL receiver still connected on at least one sibling node
[2024-07-24 14:31:59] [DEBUG] do_election(): electoral term is 7
[2024-07-24 14:31:59] [INFO] 1 active sibling nodes registered
[2024-07-24 14:31:59] [INFO] 3 total nodes registered
[2024-07-24 14:31:59] [INFO] primary node  "node-B-0" (ID: 854020671) and this node have the same location ("default")
[2024-07-24 14:31:59] [INFO] local node's last receive lsn: 0/6118D820
[2024-07-24 14:31:59] [INFO] checking state of sibling node "node-A-0" (ID: 1983988088)
[2024-07-24 14:31:59] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-A port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:31:59] [INFO] node "node-A-0" (ID: 1983988088) reports its upstream is node 854020671, last seen 4 second(s) ago
[2024-07-24 14:31:59] [INFO] standby node "node-A-0" (ID: 1983988088) last saw primary node 4 second(s) ago
[2024-07-24 14:31:59] [INFO] last receive LSN for sibling node "node-A-0" (ID: 1983988088) is: 0/6118D740
[2024-07-24 14:31:59] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2024-07-24 14:31:59] [NOTICE] promotion candidate is "node-C-0" (ID: 2049659808)
[2024-07-24 14:31:59] [DEBUG] election result: WON
[2024-07-24 14:31:59] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
# do promote self
[2024-07-24 14:31:59] [NOTICE] this node is the winner, will now promote itself and inform other nodes
2024-07-24 14:31:59.438 CST [200] LOG:  received SIGHUP, reloading configuration files
[2024-07-24 14:31:59] [INFO] promote_command is:
  "PGPASSWORD=xxxx repmgr standby promote -f "/opt/bitnami/repmgr/conf/repmgr.conf" --log-to-file --verbose -F"
2024-07-24 14:31:59.439 CST [200] LOG:  parameter "wal_retrieve_retry_interval" changed to "5000"
cp: cannot stat '/bitnami/postgresql/backup/wal-back/000000070000000000000061': No such file or directory
[2024-07-24 14:31:59] [NOTICE] using provided configuration file "/opt/bitnami/repmgr/conf/repmgr.conf"
[2024-07-24 14:31:59] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-C port=5432 fallback_application_name=repmgr 
options=-csearch_path="
[2024-07-24 14:31:59] [DEBUG] set_config():
  SET synchronous_commit TO 'local'
[2024-07-24 14:31:59] [INFO] connected to standby, checking its state
[2024-07-24 14:31:59] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2024-07-24 14:31:59] [DEBUG] get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2049659808
[2024-07-24 14:31:59] [INFO] searching for primary node
[2024-07-24 14:31:59] [DEBUG] get_primary_connection():
  SELECT node_id, conninfo,          CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority         FROM repmgr.nodes    WHERE active IS TRUE      AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id

[2024-07-24 14:31:59] [INFO] checking if node 854020671 is primary
[2024-07-24 14:31:59] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-B port=5432 fallback_application_name=repmgr options=-csearch_path="
2024-07-24 14:32:03.049 CST [10734] LOG:  started streaming WAL from primary at 0/61000000 on timeline 7
[2024-07-24 14:32:03] [DEBUG] set_config():
  SET synchronous_commit TO 'local'
[2024-07-24 14:32:03] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2024-07-24 14:32:03] [INFO] current primary node is 854020671

# ???????????? why check again?
[2024-07-24 14:32:03] [ERROR] this replication cluster already has an active primary server
[2024-07-24 14:32:03] [DEBUG] get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 854020671
[2024-07-24 14:32:03] [DETAIL] current primary is "node-B-0" (ID: 854020671)
[2024-07-24 14:32:03] [DEBUG] result of promote_command: 8
[2024-07-24 14:32:03] [ERROR] promote command failed
[2024-07-24 14:32:03] [DETAIL] promote command exited with error code 8
[2024-07-24 14:32:03] [INFO] checking if original primary node has reappeared
[2024-07-24 14:32:03] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-B port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:32:06] [ERROR] _set_config(): unable to set "synchronous_commit"
[2024-07-24 14:32:06] [DETAIL] 
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[2024-07-24 14:32:06] [DETAIL] query text is:
SET synchronous_commit TO 'local'
[2024-07-24 14:32:06] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-A port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:32:06] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-C port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:32:06] [INFO] executing notification command for event "repmgrd_promote_error"
[2024-07-24 14:32:06] [DETAIL] command is:
  /opt/bitnami/repmgr/events/router.sh 2049659808 repmgrd_promote_error 1 "2024-07-24 14:32:06+0800" ""
[2024-07-24 14:32:06] [DEBUG] monitoring upstream node 854020671 in degraded state for 0 seconds
[2024-07-24 14:32:06] [DEBUG] connection check type is "ping"
[2024-07-24 14:32:07] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2024-07-24 14:32:07] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-B port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:32:12] [DEBUG] scanning 1 node records to detect new primary...
[2024-07-24 14:32:12] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-A port=5432 fallback_application_name=repmgr options=-csearch_path="
[2024-07-24 14:32:12] [DEBUG] replication lag in bytes is: 13776
[2024-07-24 14:32:15] [DEBUG] connection check type is "ping"
[2024-07-24 14:32:16] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2024-07-24 14:32:16] [DEBUG] connecting to: "user=repmgr password=xxxx connect_timeout=5 dbname=repmgr host=node-B port=5432 fallback_application_name=repmgr options=-csearch_path="
2024-07-24 14:32:18.114 CST [10734] FATAL:  terminating walreceiver due to timeout
cp: cannot stat '/bitnami/postgresql/backup/wal-back/00000008.history': No such file or directory
cp: cannot stat '/bitnami/postgresql/backup/wal-back/000000070000000000000061': No such file or directory
[2024-07-24 14:32:18] [DEBUG] resetting paired connection
[2024-07-24 14:32:18] [NOTICE] degraded monitoring timeout (12 seconds) exceeded, terminating
[2024-07-24 14:32:18] [INFO] executing notification command for event "repmgrd_shutdown"
[2024-07-24 14:32:18] [DETAIL] command is:
  /opt/bitnami/repmgr/events/router.sh 2049659808 repmgrd_shutdown 1 "2024-07-24 14:32:18+0800" "degraded monitoring timeout (12 seconds) exceeded, terminating"
[2024-07-24 14:32:19] [INFO] repmgrd terminating...
2024-07-24 14:32:19.015 CST [217] LOG:  could not receive data from client: Connection reset by peer

Questions: