postgres@psql-09:/root$ repmgr standby switchover --siblings-follow --dry-run
NOTICE: checking switchover on node "psql-09" (ID: 9) in --dry-run mode
INFO: SSH connection to host "10.10.10.7" succeeded
INFO: able to execute "repmgr" on remote host "10.10.10.7"
INFO: all sibling nodes are reachable via SSH
INFO: 4 walsenders required, 20 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: archive mode is "off"
INFO: replication lag on this standby is 2 seconds
INFO: 4 replication slots required, 20 available
NOTICE: attempting to pause repmgrd on 5 nodes
NOTICE: local node "psql-09" (ID: 9) would be promoted to primary; current primary "psql-07" (ID: 7) would be demoted to standby
INFO: following shutdown command would be run on node "psql-07":
"sudo /usr/bin/pg_ctlcluster 15 main stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
However psql-09 (which is a more powerful server) was configured to max_worker_processes=64 while psql-07 was just max_worker_processes=32. So when we actually did the switchover, we ended up in a limbo state where none of the replicas could join, because they could not restart because of the difference to that param:
Aug 21 22:11:47 psql-08 postgres[4082218]: [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
Aug 21 22:11:47 psql-08 postgres[4082221]: [1] LOG: database system was interrupted while in recovery at log time 2023-08-21 21:49:15 UTC
Aug 21 22:11:47 psql-08 postgres[4082221]: [2] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
Aug 21 22:11:48 psql-08 postgres[4082221]: [1] LOG: entering standby mode
Aug 21 22:11:48 psql-08 postgres[4082221]: [1] FATAL: recovery aborted because of insufficient parameter settings
Aug 21 22:11:48 psql-08 postgres[4082221]: [2] DETAIL: max_worker_processes = 32 is a lower setting than on the primary server, where its value was 64.
Aug 21 22:11:48 psql-08 postgres[4082221]: [3] HINT: You can restart the server after making the necessary configuration changes.
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG: startup process (PID 4082221) exited with exit code 1
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG: aborting startup due to startup process failure
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG: database system is shut down
We were just carrying out a switchover of our primary using repmgr 5.3.3:
sudo -u postgres repmgr standby switchover --siblings-follow --dry-run
However psql-09 (which is a more powerful server) was configured to
max_worker_processes=64
while psql-07 was justmax_worker_processes=32
. So when we actually did the switchover, we ended up in a limbo state where none of the replicas could join, because they could not restart because of the difference to that param:That's unexpected that this was not caught 😬