'Standby follow' problem when old primary comes back after promotion of a new primary

MannerMan commented 5 years ago

Hello!

First of all, I want to give my thanks for providing repmgr suite for managing postgres replication and failover scenarios, we have used it with great success in my organization.

I would like to describe an incident that occurred in our production environment some days ago however, were our repmgrd failover setup could not perform a recovery, due to some unfortunate circumstances. I will begin by describing our setup:

We have one primary database, with two attached standby replicas. All servers run repmgr(d) 4.1.1 and postgres 10.5. Our traffic is sent to the database through a pgbouncer node in front of the cluster. Repmgr is configured with a custom promote_command, which triggers a wrapper script that routes traffic in pgbouncer to the node being promoted, and then performing the actual promotion. Since we use repmgrd, the remaining standby is automatically instructed to follow the new primary that was promoted. Should promotion fail (i.e. "repmgr standby promote" exit with a non-zero status), the wrapper script will re-route the traffic in pgbouncer back to the original primary.

This setup has worked very well for us, both in testing the automatic failover, and doing real production switchovers (although manual, through a very similar script to the one used as promote_command).

But of course, when things go wrong in the real world, things tend not to go quite as expected :) And this was the case for us this weekend, when a virtualization host crashed due to a memory error. One of the virtual machines running on this host was a primary database server for one of our repmgrd clusters. Now, the automatic re-routing of traffic and promotion of a new standby did work flawlessly, but due to a another H/A mechanism in the virtualization layer, the old primary came back online, just as a new secondary had been promoted, but before the other standby was instructed to follow the new primary. This seems to have caused the other standby to abort its attempt to follow the new primary, since it could now reach the old primary again.

repmgr.log from this standby:

[2019-01-07 01:11:42] [INFO] checking state of node 1, 11 of 12 attempts [2019-01-07 01:11:44] [INFO] sleeping 5 seconds until next reconnection attempt [2019-01-07 01:11:49] [INFO] checking state of node 1, 12 of 12 attempts [2019-01-07 01:11:51] [WARNING] unable to reconnect to node 1 after 12 attempts [2019-01-07 01:11:51] [INFO] follower node awaiting notification from the candidate node [2019-01-07 01:12:07] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log" [2019-01-07 01:12:07] [NOTICE] setting node 3's primary to node 1 [2019-01-07 01:12:07] [NOTICE] restarting server using "sudo systemctl restart postgresql-10" [2019-01-07 01:12:09] [NOTICE] STANDBY FOLLOW successful [2019-01-07 01:12:09] [DETAIL] standby attached to upstream node "prod_pgc1_db01" (node ID: 1) [2019-01-07 01:12:09] [NOTICE] node 3 now following new upstream node 2 [2019-01-07 01:12:09] [INFO] resuming standby monitoring mode [2019-01-07 01:12:09] [DETAIL] following new primary "prod_pgc1_db02" (node id: 2) [2019-01-07 01:13:10] [INFO] node "prod_pgc1_db03" (node ID: 3) monitoring upstream node "prod_pgc1_db01" (node ID: 1) in normal state

We do use synchronous replication, with the "synchronous_standby_names ANY 1 (..)" option on all nodes to ensure that there is always have at least one replica attached for the node acting as primary to be able to accept writes. We use this essentially as split-brain protection, since this ensures that at least two nodes in the cluster of three are in agreement to accept writes. And in this incident, this protection did work as intended and prevented the lone standby that was promoted from accepting writes. But since the "repmgr standby promote" command succeeded, traffic was routed to the new (read-only) primary, causing some service disruption.

It seems we were very unlucky here, had the old primary come back a few seconds sooner, promotion should have failed and the service would have come back online. If the old primary had come back a few seconds later than it did, the standby node should have followed the new primary, and the service would have come back online. Murphy's law I guess :)

I'm now looking into ways to mitigate this specific case, i.e. were promotion of a new primary is successful, but before the other standby has begun to follow the new primary, the old primary comes back online. It's probably possible to engineer something in our promotion wrapper script, that verifies with "repmgr cluster show" or similar that the other standby is following the new primary. I would like to avoid putting too much logic/complexity in this script though. For our use-case, it would be very useful if one could provide a option to "repmgr standby promote" not to exit with 0 unless the other standby(s) begins to follow the new primary. That would trigger our existing mitigation and re-route traffic to the old primary. "repmgr standby switchover" appears to have such an option with '--siblings-follow' (which we use for manual switchovers).

I am open to any suggestion in trying to mitigate this specific case however. Please let me know if any further information or logs are required!

ibarwick commented 5 years ago

Hi

Thanks for the detailed report; working through the backlog here.

First of all, I want to give my thanks for providing repmgr suite for managing postgres replication and failover scenarios, we have used it with great success in my organization.

Much appreciated!

it would be very useful if one could provide a option to "repmgr standby promote" not to exit with 0 unless the other standby(s) begins to follow the new primary.

This sounds potentially useful and implementable with reasonable effort; we'll look into it.

Could you provide the repmgr.conf files for each node (suitably anonymized if necessary)? That would help to rule out certain issues and/or reliably reproduce the situation.

Thanks!

MannerMan commented 5 years ago

Hi,

Thank you for taking the time to look into this issue!

Indeed, I have attached the repmgr.conf file from all three nodes to this comment. After the incident we increased "reconnect_attempts" from 12 to 24 (at 5 second interval) to give the VMware-level H/A mechanism more time to resolve the situation and decrease the chance of the same issue happening again. Had to anonymize our internal IP-addresses for "conninfo"-option but the repmgr configuration remains unchanged otherwise. Please let me know if you require any further information or configuration files.

(I was required to add .txt extension to filename for GitHub to allow upload) repmgr_db01.conf.txt repmgr_db02.conf.txt repmgr_db03.conf.txt

MannerMan commented 5 years ago

We had another incident like this one yesterday. Our primary server in our setup of three stopped responding, promotion triggered but the primary came back/responded again while promotion was in progress. Logs from standby awaiting promotion notification to follow new primary:

[2019-11-05 20:09:08] [WARNING] unable to reconnect to node 1 after 24 attempts
[2019-11-05 20:09:09] [INFO] follower node awaiting notification from a candidate node
[2019-11-05 20:09:24] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log"
[2019-11-05 20:09:24] [NOTICE] setting node 3's primary to node 1
[2019-11-05 20:09:24] [NOTICE] restarting server using "sudo systemctl restart postgresql-10"
[2019-11-05 20:09:27] [NOTICE] STANDBY FOLLOW successful
[2019-11-05 20:09:27] [DETAIL] standby attached to upstream node "utv_pgc1_db01" (node ID: 1)
INFO:  set_repmgrd_pid(): provided pidfile is /var/run/repmgr/repmgrd-10.pid
[2019-11-05 20:09:27] [NOTICE] node 3 now following new upstream node 2
[2019-11-05 20:09:27] [INFO] resuming standby monitoring mode
[2019-11-05 20:09:27] [DETAIL] following new primary "utv_pgc1_db02" (node id: 2)
[2019-11-05 20:09:34] [WARNING] unable to connect to upstream node "utv_pgc1_db01" (node ID: 1)
[2019-11-05 20:09:34] [INFO] checking state of node 1, 1 of 24 attempts
[2019-11-05 20:09:34] [INFO] sleeping 5 seconds until next reconnection attempt
[2019-11-05 20:09:39] [INFO] checking state of node 1, 2 of 24 attempts
[2019-11-05 20:09:39] [NOTICE] node has recovered, reconnecting
[2019-11-05 20:09:39] [INFO] connection to node 1 succeeded
[2019-11-05 20:09:39] [INFO] original connection is still available
[2019-11-05 20:09:39] [NOTICE] reconnected to upstream node after 5 seconds
[2019-11-05 20:10:29] [INFO] node "utv_pgc1_db03" (node ID: 3) monitoring upstream node "utv_pgc1_db01" (node ID: 1) in normal state

As can be seen from the log, this node ended up still attached to old primary (node 1, that failed) rather than new primary (node 2). Our synchronous_standby_names configuration yet again protected us from split brain, but with outage as a result.

MannerMan commented 3 years ago

@ibarwick

We had a similar case of this failure happening again this weeked. We're still running repmgr 4.2, but I looked through the release notes and I don't think any more recent release would have prevented this; please correct me if I'm wrong!

Our primary server ran out of kernel memory, we're unsure why but some application on the servers appears to have caused it by spawning new threads indefinitly. This caused queries/connections on db01 to fail randomly;

2021-03-06 22:53:00.730 CET "30848" "1683" "" "" LOG:  could not fork new process for connection: Cannot allocate memory
2021-03-06 22:53:00.731 CET "30848" "1684" "" "" LOG:  could not fork new process for connection: Cannot allocate memory
2021-03-06 22:53:00.733 CET "30848" "1685" "" "" LOG:  could not fork new process for connection: Cannot allocate memory
2021-03-06 22:53:00.749 CET "24260" "1" "repmgr" "[unknown]" LOG:  could not receive data from client: Connection reset by peer
2021-03-06 22:53:00.749 CET "28411" "1" "repmgr" "repmgr" LOG:  could not receive data from client: Connection reset by peer
2021-03-06 22:53:34.183 CET "30848" "1686" "" "" LOG:  could not fork new process for connection: Cannot allocate memory
2021-03-06 22:53:34.183 CET "30848" "1687" "" "" LOG:  could not fork new process for connection: Cannot allocate memory
2021-03-06 22:53:36.194 CET "30848" "1688" "" "" LOG:  could not fork new process for connection: Cannot allocate memory

Our 'db03' node noticed the failure and was waiting for the candidate node to promote (db02)

2021-03-06 22:52:39] [INFO] sleeping 5 seconds until next reconnection attempt
[2021-03-06 22:52:44] [INFO] checking state of node 1, 23 of 24 attempts
[2021-03-06 22:52:44] [INFO] sleeping 5 seconds until next reconnection attempt
[2021-03-06 22:52:49] [INFO] checking state of node 1, 24 of 24 attempts
[2021-03-06 22:52:49] [WARNING] unable to reconnect to node 1 after 24 attempts
[2021-03-06 22:52:49] [INFO] follower node awaiting notification from a candidate node

And our db02 node promptly promoted itself to primary, and reconfigured db traffic to go to it:

[2021-03-06 22:52:55] [NOTICE] promoting standby to primary
[2021-03-06 22:52:55] [DETAIL] promoting server "prod_pgc3_db02" (ID: 2) using "/usr/pgsql-10/bin/pg_ctl  -w -D '/var/lib/pgsql/10/data' promote"
[2021-03-06 22:52:55] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2021-03-06 22:52:59] [NOTICE] STANDBY PROMOTE successful
[2021-03-06 22:52:59] [DETAIL] server "prod_pgc3_db02" (ID: 2) was successfully promoted to primary
Starting pgbouncer on all traffic routers
starting pgbouncer on 10.98.95.33
starting pgbouncer on 10.98.95.34
[2021-03-06 22:53:00] [NOTICE] 1 followers to notify
[2021-03-06 22:53:00] [NOTICE] notifying node 3 to follow node 2
INFO:  node 3 received notification to follow node 2
[2021-03-06 22:53:00] [INFO] switching to primary monitoring mode
[2021-03-06 22:53:00] [NOTICE] monitoring cluster primary "prod_pgc3_db02" (node ID: 2)

However, db03 could now see it's primary (db01) again and did not begin to follow db02 after promotion;

[2021-03-06 22:53:00] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log"

[2021-03-06 22:53:00] [ERROR] connection to database failed:
        could not fork new process for connection: Cannot allocate memory

could not fork new process for connection: Cannot allocate memory

[2021-03-06 22:53:00] [ERROR] unable to establish a replication connection to the primary node
[2021-03-06 22:53:00] [NOTICE] original primary reappeared - no action taken
[2021-03-06 22:53:00] [WARNING] unable to create event record
[2021-03-06 22:53:00] [DETAIL] function requires at least protocol version 3.0

[2021-03-06 22:53:00] [DETAIL]  INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
[2021-03-06 22:53:00] [NOTICE] 1 followers to notify
[2021-03-06 22:53:00] [NOTICE] notifying node 2 to follow node 1
INFO:  node 2 received notification to follow node 1
[2021-03-06 22:53:00] [INFO] resuming standby monitoring mode
[2021-03-06 22:53:00] [DETAIL] original primary "prod_pgc3_db01" (node ID: 1) reappeared

repmgrd on db03 then proceeded to crash when reset_node_voting_status() failed;

[2021-03-06 22:53:00] [WARNING] unable to create event record
[2021-03-06 22:53:00] [DETAIL] function requires at least protocol version 3.0

[2021-03-06 22:53:00] [DETAIL]  INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
[2021-03-06 22:53:00] [NOTICE] 1 followers to notify
[2021-03-06 22:53:00] [NOTICE] notifying node 2 to follow node 1
INFO:  node 2 received notification to follow node 1
[2021-03-06 22:53:00] [INFO] resuming standby monitoring mode
[2021-03-06 22:53:00] [DETAIL] original primary "prod_pgc3_db01" (node ID: 1) reappeared
[2021-03-06 22:53:00] [ERROR] get_primary_node_id(): unable to execute query
[2021-03-06 22:53:00] [DETAIL] query text is:
SELECT node_id                   FROM repmgr.nodes     WHERE type = 'primary'    AND active IS TRUE
[2021-03-06 22:53:00] [ERROR] reset_node_voting_status(): local_conn not set
[2021-03-06 22:53:00] [ERROR] unable to retrieve record for upstream node (ID: 1), terminating
[2021-03-06 22:53:00] [INFO] repmgrd terminating...

The cluster was now in the following state:

db01 was still failing queries due to kernel memory exhaustion
db02 had promoted itself to primary and re-routed traffic to itself
db03 was still following db01 since it considered it as 'reappeared'
write-queries were failing since db02 had no standby attach (we use quorum standby names as mentioned previously)

I think this is a classic 'STONITH' issue where db01 is still around and causing problems for the failover. I think the following fixes could be valid;

have repmgr standby promote exit with non-zero if siblings does not follow the new primary as originally suggested in this issue. This would have triggered our fallback script and routed traffic back to db01. In this instance it would not have been great either though, since db01 was still in a bad state and queries where failing.
have a configuration option in repmgr to disable checking if the original primary 're-appears' when entering follower node awaiting notification from a candidate node-mode. This would probably have ensured a successful failover in this case. However, this could also have caused db01 to send new WAL to db03 when it re-appeared, while promotion of db02 was in progress, causing db03 not to be able to follow db02 due to being ahead in the WAL timeline. I guess the only way to prevent that would be if repmgrd also disconnected db02 from replicating from db01 in that event.

Thoughts?

EnterpriseDB / repmgr

'Standby follow' problem when old primary comes back after promotion of a new primary #520