Closed nucfisher closed 6 years ago
The same issue is for REL_4_0_STABLE (4.0.1).
I'm seeing exact problem with 4.0.1 (4.0.1-1.pgdg14.04+1 0 from apt.postgresql.org). Using switchover promotes the current standby to primary but kills the primary it was promoting.
Host names redacted in the output below.
BEFORE:
ID | Name | Role | Status | Upstream | Location | Connection string
----+--------------------------+---------+-----------+--------------------------+----------+---------------------------------------
1 | xxxxxxxxxxxxxxxxxxxx1 | standby | running | xxxxxxxxxxxxxxxxxxxx2 | default | host=xxx1 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5
2 | xxxxxxxxxxxxxxxxxxxx2 | primary | * running | | default | host=xxx2 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5
3 | xxxxxxxxxxxxxxxxxxxx3 | standby | running | xxxxxxxxxxxxxxxxxxxx2 | default | host=xxx3 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5
AFTER: ID | Name | Role | Status | Upstream | Location | Connection string ----+--------------------------+---------+-----------+--------------------------+----------+----------------------------------------------------------------- 1 | xxxxxxxxxxxxxxxxxxxx1 | primary | * running | | default | host=xxx1 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5 2 | xxxxxxxxxxxxxxxxxxxx2 | primary | - failed | | default | host=xxx2 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5 3 | xxxxxxxxxxxxxxxxxxxx3 | standby | running | xxxxxxxxxxxxxxxxxxxx1 | default | host=xxx3 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5
Our fail-over testing procedure requires this to work in order to fail-back after a fail-over. It's another of several disappointing bugs in 4.x which has been forced upon us via the APT repo due to the removal of 3.x :-(
Once this has happened, my recovery procedure is to re-clone the broken primary as a standby and reregister it, and then restart repmgrd on all nodes as it otherwise continues to believe that the old primary is still the primary (I'll raise a separate bug for this).
As of issues, assigned to 4.0.2, I would say that #354, #349 and #343 are P2 and #346 is P3. Looking forward to...
Issue now fixed; 4.0.2 release is scheduled for later next week.
I've upgraded repmgr from 3.4dev to 4.1dev (as of 2017-11-29) and 'standby switchover' doesn't work anymore.
--dry-run looks ok but real switchover stops postgresql on primary node and waits 1,2,3,4,5,6 then fails.
My configuration: Astra Linux 1.5 Special Edition (based on Debian 7.8) Postgresql 9.4.5 (or 9.4.10) pg_rewind additionnaly installed
postgresql.conf replication parameters:
repmgr.conf:
repmgrd is disabled
Let's go:
repmgr cluster show:
switchover --dry-run:
real switchover attempt
after switchover attempt the postgresql is down on the primary node: