EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.54k stars 250 forks source link

switchover fails on repmgr 4.1dev #349

Closed nucfisher closed 6 years ago

nucfisher commented 6 years ago

I've upgraded repmgr from 3.4dev to 4.1dev (as of 2017-11-29) and 'standby switchover' doesn't work anymore.

--dry-run looks ok but real switchover stops postgresql on primary node and waits 1,2,3,4,5,6 then fails.

My configuration: Astra Linux 1.5 Special Edition (based on Debian 7.8) Postgresql 9.4.5 (or 9.4.10) pg_rewind additionnaly installed

postgresql.conf replication parameters:

    hot_standby = on
    wal_level = 'hot_standby'
    max_wal_senders = 5
    wal_keep_segments = 20
    wal_log_hints = on
    max_replication_slots = 2
    archive_mode = on
    archive_command = 'cd .'
    checkpoint_segments = 8
    shared_preload_libraries = 'repmgr, pg_rewind_support'

repmgr.conf:

node_id=2
node_name=astra2
conninfo='host=astra2 dbname=repmgr_db user=repmgr_usr'
data_directory=/var/lib/postgresql/9.4/main

use_replication_slots=1
log_level=DEBUG
log_facility=STDERR
log_file='/var/log/repmgr/repmgr.log'

pg_bindir=/usr/bin/
#pg_bindir=/usr/lib/postgresql/9.4/bin/
pg_ctl_options='-s -l /var/log/repmgr/repmgr-pg_ctl.log'
reconnect_attempts=6
reconnect_interval=10
failover=manual  # one of 'automatic', 'manual'
priority=100        # a value of zero or less prevents the node being promoted to master
promote_command='repmgr standby promote -f /etc/repmgr/repmgr.conf'
follow_command='repmgr standby follow -f /etc/repmgr/repmgr.conf -W'
primary_notification_timeout=1800

repmgrd is disabled

Let's go:

repmgr cluster show:

root@astra153:/etc/repmgr/RR# ./cluster_show.sh
NOTICE: using provided configuration file "/etc/repmgr/repmgr.conf"
 ID | Name   | Role    | Status    | Upstream | Location | Connection string                           
----+--------+---------+-----------+----------+----------+----------------------------------------------
 2  | astra2 | primary | * running |          | default  | host=astra2 dbname=repmgr_db user=repmgr_usr
 3  | astra3 | standby |   running | astra2   | default  | host=astra3 dbname=repmgr_db user=repmgr_usr

switchover --dry-run:

NOTICE: using provided configuration file "/etc/repmgr/repmgr.conf"
DEBUG: connecting to: "user=repmgr_usr dbname=repmgr_db host=astra3 connect_timeout=2 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_node_record():
  SELECT node_id, type, upstream_node_id, node_name, conninfo, repluser, slot_name, location, priority, active, config_file, '' AS upstream_node_name   FROM repmgr.nodes  WHERE node_id = 3
NOTICE: checking switchover on node "astra3" (ID: 3) in --dry-run mode
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: searching for primary node
DEBUG: get_primary_connection():
  SELECT node_id, conninfo,          CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority         FROM repmgr.nodes    WHERE active IS TRUE      AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 2 is primary
DEBUG: connecting to: "user=repmgr_usr dbname=repmgr_db host=astra2 connect_timeout=2 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: current primary node is 2
DEBUG: get_node_record():
  SELECT node_id, type, upstream_node_id, node_name, conninfo, repluser, slot_name, location, priority, active, config_file, '' AS upstream_node_name   FROM repmgr.nodes  WHERE node_id = 2
DEBUG: remote node name is "astra2"
DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /bin/true 2>/dev/null
INFO: SSH connection to host "astra2" succeeded
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf --version 2>/dev/null && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
  repmgr 4.1dev
1

INFO: able to execute "repmgr" on remote host "localhost"
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'archive_mode' AND setting != 'off'
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node check --terse -LERROR --archive-ready --optformat
DEBUG: remote_command(): output returned was:
  --status=OK --files=0

INFO: 0 pending archive files
DEBUG: get_replication_lag_seconds():
 SELECT CASE WHEN (pg_catalog.pg_last_xlog_receive_location() = pg_catalog.pg_last_xlog_replay_location())           THEN 0         ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp()))::INT           END         AS lag_seconds
DEBUG: lag is 0 
INFO: replication lag on this standby is 0 seconds
DEBUG: get_active_sibling_node_records():
  SELECT node_id, type, upstream_node_id, node_name, conninfo, repluser, slot_name, location, priority, active, config_file, '' AS upstream_node_name     FROM repmgr.nodes    WHERE upstream_node_id = 2      AND node_id != 3      AND active IS TRUE ORDER BY node_id 
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
NOTICE: local node "astra3" (ID: 3) will be promoted to primary; current primary "astra2" (ID: 2) will be demoted to standby
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node service --terse -LERROR --list-actions --action=stop
DEBUG: remote_command(): output returned was:
  /usr/bin/pg_ctl -s -D '/var/lib/postgresql/9.4/main' -m fast -W stop

INFO: following shutdown command would be run on node "astra2":
  "/usr/bin/pg_ctl -s -D '/var/lib/postgresql/9.4/main' -m fast -W stop"
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking

real switchover attempt

root@astra153:/etc/repmgr/RR# ./standby_switchover.sh
**==== CALL cluster_show ====**
NOTICE: using provided configuration file "/etc/repmgr/repmgr.conf"
 ID | Name   | Role    | Status    | Upstream | Location | Connection string                           
----+--------+---------+-----------+----------+----------+----------------------------------------------
 2  | astra2 | primary | * running |          | default  | host=astra2 dbname=repmgr_db user=repmgr_usr
 3  | astra3 | standby |   running | astra2   | default  | host=astra3 dbname=repmgr_db user=repmgr_usr
**==== CALL switchover ====**
NOTICE: using provided configuration file "/etc/repmgr/repmgr.conf"
DEBUG: connecting to: "user=repmgr_usr dbname=repmgr_db host=astra3 connect_timeout=2 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_node_record():
  SELECT node_id, type, upstream_node_id, node_name, conninfo, repluser, slot_name, location, priority, active, config_file, '' AS upstream_node_name   FROM repmgr.nodes  WHERE node_id = 3
NOTICE: executing switchover on node "astra3" (ID: 3)
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: searching for primary node
DEBUG: get_primary_connection():
  SELECT node_id, conninfo,          CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority         FROM repmgr.nodes    WHERE active IS TRUE      AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 2 is primary
DEBUG: connecting to: "user=repmgr_usr dbname=repmgr_db host=astra2 connect_timeout=2 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: current primary node is 2
DEBUG: get_node_record():
  SELECT node_id, type, upstream_node_id, node_name, conninfo, repluser, slot_name, location, priority, active, config_file, '' AS upstream_node_name   FROM repmgr.nodes  WHERE node_id = 2
DEBUG: remote node name is "astra2"
DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /bin/true 2>/dev/null
INFO: SSH connection to host "astra2" succeeded
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf --version 2>/dev/null && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
  repmgr 4.1dev
1

DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'archive_mode' AND setting != 'off'
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node check --terse -LERROR --archive-ready --optformat
DEBUG: remote_command(): output returned was:
  --status=OK --files=0

INFO: 0 pending archive files
DEBUG: get_replication_lag_seconds():
 SELECT CASE WHEN (pg_catalog.pg_last_xlog_receive_location() = pg_catalog.pg_last_xlog_replay_location())           THEN 0         ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp()))::INT           END         AS lag_seconds
DEBUG: lag is 0 
INFO: replication lag on this standby is 0 seconds
DEBUG: get_active_sibling_node_records():
  SELECT node_id, type, upstream_node_id, node_name, conninfo, repluser, slot_name, location, priority, active, config_file, '' AS upstream_node_name     FROM repmgr.nodes    WHERE upstream_node_id = 2      AND node_id != 3      AND active IS TRUE ORDER BY node_id 
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
NOTICE: local node "astra3" (ID: 3) will be promoted to primary; current primary "astra2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "astra2" (ID: 2)
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node service --action=stop --checkpoint
DEBUG: connecting to: "user=repmgr_usr dbname=repmgr_db host=astra2 connect_timeout=2 fallback_application_name=repmgr"
NOTICE: issuing CHECKPOINT
DETAIL: executing server command "/usr/bin/pg_ctl -s -D '/var/lib/postgresql/9.4/main' -m fast -W stop"
DEBUG: remote_command(): output returned was:

INFO: checking primary status; 1 of 6 attempts
INFO: checking primary status; 2 of 6 attempts
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node status --is-shutdown-cleanly
DEBUG: remote_command(): output returned was:
  --state=UNCLEAN_SHUTDOWN

INFO: checking primary status; 3 of 6 attempts
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node status --is-shutdown-cleanly
DEBUG: remote_command(): output returned was:
  --state=UNCLEAN_SHUTDOWN

INFO: checking primary status; 4 of 6 attempts
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node status --is-shutdown-cleanly
DEBUG: remote_command(): output returned was:
  --state=UNCLEAN_SHUTDOWN

INFO: checking primary status; 5 of 6 attempts
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 astra2 /usr/bin/repmgr -f /etc/repmgr/repmgr.conf node status --is-shutdown-cleanly
DEBUG: remote_command(): output returned was:
  --state=UNCLEAN_SHUTDOWN

*** glibc detected *** /usr/lib/postgresql/9.4/bin/repmgr: free(): invalid pointer: 0x000055a2f45ee2f0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7dae6)[0x7f1076c9aae6]
/usr/lib/postgresql/9.4/bin/repmgr(+0x1381d)[0x55a2f247481d]
/usr/lib/postgresql/9.4/bin/repmgr(main+0x207a)[0x55a2f2466e2a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f1076c3e925]
/usr/lib/postgresql/9.4/bin/repmgr(+0x6965)[0x55a2f2467965]
======= Memory map: ========
55a2f2461000-55a2f24b3000 r-xp 00000000 08:01 165343                     /usr/lib/postgresql/9.4/bin/repmgr
55a2f26b2000-55a2f26b3000 r--p 00051000 08:01 165343                     /usr/lib/postgresql/9.4/bin/repmgr
55a2f26b3000-55a2f26c1000 rw-p 00052000 08:01 165343                     /usr/lib/postgresql/9.4/bin/repmgr
55a2f26c1000-55a2f26c8000 rw-p 00000000 00:00 0 
55a2f45e5000-55a2f464c000 rw-p 00000000 00:00 0                          [heap]
7f10737d7000-7f10737ec000 r-xp 00000000 08:01 522244                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7f10737ec000-7f10739ec000 ---p 00015000 08:01 522244                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7f10739ec000-7f10739ed000 rw-p 00015000 08:01 522244                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7f10739ed000-7f10739f0000 r-xp 00000000 08:01 522578                     /lib/x86_64-linux-gnu/libgpg-error.so.0.8.0
7f10739f0000-7f1073bef000 ---p 00003000 08:01 522578                     /lib/x86_64-linux-gnu/libgpg-error.so.0.8.0
7f1073bef000-7f1073bf0000 rw-p 00002000 08:01 522578                     /lib/x86_64-linux-gnu/libgpg-error.so.0.8.0
7f1073bf0000-7f1073bf9000 r-xp 00000000 08:01 522267                     /lib/x86_64-linux-gnu/libcrypt-2.15.so
7f1073bf9000-7f1073df8000 ---p 00009000 08:01 522267                     /lib/x86_64-linux-gnu/libcrypt-2.15.so
7f1073df8000-7f1073df9000 r--p 00008000 08:01 522267                     /lib/x86_64-linux-gnu/libcrypt-2.15.so
7f1073df9000-7f1073dfa000 rw-p 00009000 08:01 522267                     /lib/x86_64-linux-gnu/libcrypt-2.15.so
7f1073dfa000-7f1073e28000 rw-p 00000000 00:00 0 
7f1073e28000-7f1073e39000 r-xp 00000000 08:01 136651                     /usr/lib/x86_64-linux-gnu/libp11-kit.so.0.0.0
7f1073e39000-7f1074038000 ---p 00011000 08:01 136651                     /usr/lib/x86_64-linux-gnu/libp11-kit.so.0.0.0
7f1074038000-7f1074039000 r--p 00010000 08:01 136651                     /usr/lib/x86_64-linux-gnu/libp11-kit.so.0.0.0
7f1074039000-7f107403a000 rw-p 00011000 08:01 136651                     /usr/lib/x86_64-linux-gnu/libp11-kit.so.0.0.0
7f107403a000-7f1074051000 r-xp 00000000 08:01 522374                     /lib/x86_64-linux-gnu/libz.so.1.2.7
7f1074051000-7f1074250000 ---p 00017000 08:01 522374                     /lib/x86_64-linux-gnu/libz.so.1.2.7
7f1074250000-7f1074251000 r--p 00016000 08:01 522374                     /lib/x86_64-linux-gnu/libz.so.1.2.7
7f1074251000-7f1074252000 rw-p 00017000 08:01 522374                     /lib/x86_64-linux-gnu/libz.so.1.2.7
7f1074252000-7f10742cd000 r-xp 00000000 08:01 522565                     /lib/x86_64-linux-gnu/libgcrypt.so.11.7.0
7f10742cd000-7f10744cd000 ---p 0007b000 08:01 522565                     /lib/x86_64-linux-gnu/libgcrypt.so.11.7.0
7f10744cd000-7f10744ce000 r--p 0007b000 08:01 522565                     /lib/x86_64-linux-gnu/libgcrypt.so.11.7.0
7f10744ce000-7f10744d1000 rw-p 0007c000 08:01 522565                     /lib/x86_64-linux-gnu/libgcrypt.so.11.7.0
7f10744d1000-7f10744e1000 r-xp 00000000 08:01 132068                     /usr/lib/x86_64-linux-gnu/libtasn1.so.3.1.16
7f10744e1000-7f10746e0000 ---p 00010000 08:01 132068                     /usr/lib/x86_64-linux-gnu/libtasn1.so.3.1.16
7f10746e0000-7f10746e1000 r--p 0000f000 08:01 132068                     /usr/lib/x86_64-linux-gnu/libtasn1.so.3.1.16
7f10746e1000-7f10746e2000 rw-p 00010000 08:01 132068                     /usr/lib/x86_64-linux-gnu/libtasn1.so.3.1.16
7f10746e2000-7f10746ef000 r-xp 00000000 08:01 522370                     /lib/libgost.so.2.0.2
7f10746ef000-7f10748ee000 ---p 0000d000 08:01 522370                     /lib/libgost.so.2.0.2
7f10748ee000-7f10748ef000 rw-p 0000c000 08:01 522370                     /lib/libgost.so.2.0.2
7f10748ef000-7f10749a8000 r-xp 00000000 08:01 136665                     /usr/lib/x86_64-linux-gnu/libgnutls.so.26.22.4
7f10749a8000-7f1074ba7000 ---p 000b9000 08:01 136665                     /usr/lib/x86_64-linux-gnu/libgnutls.so.26.22.4
7f1074ba7000-7f1074bad000 r--p 000b8000 08:01 136665                     /usr/lib/x86_64-linux-gnu/libgnutls.so.26.22.4
7f1074bad000-7f1074baf000 rw-p 000be000 08:01 136665                     /usr/lib/x86_64-linux-gnu/libgnutls.so.26.22.4
7f1074baf000-7f1074bc9000 r-xp 00000000 08:01 140088                     /usr/lib/x86_64-linux-gnu/libsasl2.so.2.0.25
7f1074bc9000-7f1074dc8000 ---p 0001a000 08:01 140088                     /usr/lib/x86_64-linux-gnu/libsasl2.so.2.0.25
7f1074dc8000-7f1074dc9000 r--p 00019000 08:01 140088                     /usr/lib/x86_64-linux-gnu/libsasl2.so.2.0.25
7f1074dc9000-7f1074dca000 rw-p 0001a000 08:01 140088                     /usr/lib/x86_64-linux-gnu/libsasl2.so.2.0.25
7f1074dca000-7f1074dd8000 r-xp 00000000 08:01 133162                     /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2.10.3
7f1074dd8000-7f1074fd7000 ---p 0000e000 08:01 133162                     /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2.10.3
7f1074fd7000-7f1074fd8000 r--p 0000d000 08:01 133162                     /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2.10.3
7f1074fd8000-7f1074fd9000 rw-p 0000e000 08:01 133162                     /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2.10.3
7f1074fd9000-7f1074fef000 r-xp 00000000 08:01 522256                     /lib/x86_64-linux-gnu/libresolv-2.15.so
7f1074fef000-7f10751ef000 ---p 00016000 08:01 522256                     /lib/x86_64-linux-gnu/libresolv-2.15.so
7f10751ef000-7f10751f0000 r--p 00016000 08:01 522256                     /lib/x86_64-linux-gnu/libresolv-2.15.so
7f10751f0000-7f10751f1000 rw-p 00017000 08:01 522256                     /lib/x86_64-linux-gnu/libresolv-2.15.so
7f10751f1000-7f10751f3000 rw-p 00000000 00:00 0 
7f10751f3000-7f10751f6000 r-xp 00000000 08:01 523938                     /lib/x86_64-linux-gnu/libkeyutils.so.1.4
7f10751f6000-7f10753f5000 ---p 00003000 08:01 523938                     /lib/x86_64-linux-gnu/libkeyutils.so.1.4
7f10753f5000-7f10753f6000 r--p 00002000 08:01 523938                     /lib/x86_64-linux-gnu/libkeyutils.so.1.4
7f10753f6000-7f10753f7000 rw-p 00003000 08:01 523938                     /lib/x86_64-linux-gnu/libkeyutils.so.1.4
7f10753f7000-7f1075402000 r-xp 00000000 08:01 133109                     /usr/lib/x86_64-linux-gnu/libkrb5support.so.0.1
7f1075402000-7f1075601000 ---p 0000b000 08:01 133109                     /usr/lib/x86_64-linux-gnu/libkrb5support.so.0.1
7f1075601000-7f1075602000 r--p 0000a000 08:01 133109                     /usr/lib/x86_64-linux-gnu/libkrb5support.so.0.1
7f1075602000-7f1075603000 rw-p 0000b000 08:01 133109                     /usr/lib/x86_64-linux-gnu/libkrb5support.so.0.1
7f1075603000-7f1075606000 r-xp 00000000 08:01 522313                     /lib/x86_64-linux-gnu/libcom_err.so.2.1
7f1075606000-7f1075805000 ---p 00003000 08:01 522313                     /lib/x86_64-linux-gnu/libcom_err.so.2.1
7f1075805000-7f1075806000 r--p 00002000 08:01 522313                     /lib/x86_64-linux-gnu/libcom_err.so.2.1
7f1075806000-7f1075807000 rw-p 00003000 08:01 522313                     /lib/x86_64-linux-gnu/libcom_err.so.2.1
7f1075807000-7f1075835000 r-xp 00000000 08:01 133085                     /usr/lib/x86_64-linux-gnu/libk5crypto.so.3.1
7f1075835000-7f1075a34000 ---p 0002e000 08:01 133085                     /usr/lib/x86_64-linux-gnu/libk5crypto.so.3.1
7f1075a34000-7f1075a36000 r--p 0002d000 08:01 133085                     /usr/lib/x86_64-linux-gnu/libk5crypto.so.3.1
7f1075a36000-7f1075a37000 rw-p 0002f000 08:01 133085                     /usr/lib/x86_64-linux-gnu/libk5crypto.so.3.1
7f1075a37000-7f1075a38000 rw-p 00000000 00:00 0 
7f1075a38000-7f1075af9000 r-xp 00000000 08:01 133104                     /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3
7f1075af9000-7f1075cf8000 ---p 000c1000 08:01 133104                     /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3
7f1075cf8000-7f1075d06000 r--p 000c0000 08:01 133104                     /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3
7f1075d06000-7f1075d09000 rw-p 000ce000 08:01 133104                     /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3
7f1075d09000-7f1075d21000 r-xp 00000000 08:01 522255                     /lib/x86_64-linux-gnu/libpthread-2.15.so
7f1075d21000-7f1075f20000 ---p 00018000 08:01 522255                     /lib/x86_64-linux-gnu/libpthread-2.15.so
7f1075f20000-7f1075f21000 r--p 00017000 08:01 522255                     /lib/x86_64-linux-gnu/libpthread-2.15.so
7f1075f21000-7f1075f22000 rw-p 00018000 08:01 522255                     /lib/x86_64-linux-gnu/libpthread-2.15.so
7f1075f22000-7f1075f26000 rw-p 00000000 00:00 0 
7f1075f26000-7f1075f72000 r-xp 00000000 08:01 133161                     /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2.10.3
7f1075f72000-7f1076172000 ---p 0004c000 08:01 133161                     /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2.10.3
7f1076172000-7f1076174000 r--p 0004c000 08:01 133161                     /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2.10.3
7f1076174000-7f1076175000 rw-p 0004e000 08:01 133161                     /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2.10.3
7f1076175000-7f1076177000 rw-p 00000000 00:00 0 
7f1076177000-7f107617a000 r-xp 00000000 08:01 522260                     /lib/x86_64-linux-gnu/libdl-2.15.so
7f107617a000-7f1076379000 ---p 00003000 08:01 522260                     /lib/x86_64-linux-gnu/libdl-2.15.so
7f1076379000-7f107637a000 r--p 00002000 08:01 522260                     /lib/x86_64-linux-gnu/libdl-2.15.so
7f107637a000-7f107637b000 rw-p 00003000 08:01 522260                     /lib/x86_64-linux-gnu/libdl-2.15.so
7f107637b000-7f10763c1000 r-xp 00000000 08:01 133095                     /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2
7f10763c1000-7f10765c1000 ---p 00046000 08:01 133095                     /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2
7f10765c1000-7f10765c2000 r--p 00046000 08:01 133095                     /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2
7f10765c2000-7f10765c4000 rw-p 00047000 08:01 133095                     /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2
7f10765c4000-7f107678f000 r-xp 00000000 08:01 133005                     /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f107678f000-7f107698f000 ---p 001cb000 08:01 133005                     /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f107698f000-7f10769aa000 r--p 001cb000 08:01 133005                     /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f10769aa000-7f10769b9000 rw-p 001e6000 08:01 133005                     /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f10769b9000-7f10769bd000 rw-p 00000000 00:00 0 
7f10769bd000-7f1076a14000 r-xp 00000000 08:01 133006                     /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
7f1076a14000-7f1076c14000 ---p 00057000 08:01 133006                     /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
7f1076c14000-7f1076c17000 r--p 00057000 08:01 133006                     /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
7f1076c17000-7f1076c1d000 rw-p 0005a000 08:01 133006                     /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
7f1076c1d000-7f1076dd4000 r-xp 00000000 08:01 522258                     /lib/x86_64-linux-gnu/libc-2.15.so
7f1076dd4000-7f1076fd3000 ---p 001b7000 08:01 522258                     /lib/x86_64-linux-gnu/libc-2.15.so
7f1076fd3000-7f1076fd7000 r--p 001b6000 08:01 522258                     /lib/x86_64-linux-gnu/libc-2.15.so
7f1076fd7000-7f1076fd9000 rw-p 001ba000 08:01 522258                     /lib/x86_64-linux-gnu/libc-2.15.so
7f1076fd9000-7f1076fde000 rw-p 00000000 00:00 0 
7f1076fde000-7f107700a000 r-xp 00000000 08:01 133816                     /usr/lib/x86_64-linux-gnu/libpq.so.5.7
7f107700a000-7f107720a000 ---p 0002c000 08:01 133816                     /usr/lib/x86_64-linux-gnu/libpq.so.5.7
7f107720a000-7f107720d000 r--p 0002c000 08:01 133816                     /usr/lib/x86_64-linux-gnu/libpq.so.5.7
7f107720d000-7f107720e000 rw-p 0002f000 08:01 133816                     /usr/lib/x86_64-linux-gnu/libpq.so.5.7
7f107720e000-7f1077231000 r-xp 00000000 08:01 522270                     /lib/x86_64-linux-gnu/ld-2.15.so
7f10773a0000-7f10773d5000 r--s 00000000 08:01 688117                     /var/cache/nscd/hosts
7f10773d5000-7f107740a000 r--s 00000000 08:01 688111                     /var/cache/nscd/passwd
7f107740a000-7f1077416000 rw-p 00000000 00:00 0 
7f107742d000-7f1077430000 rw-p 00000000 00:00 0 
7f1077430000-7f1077431000 r--p 00022000 08:01 522270                     /lib/x86_64-linux-gnu/ld-2.15.so
7f1077431000-7f1077433000 rw-p 00023000 08:01 522270                     /lib/x86_64-linux-gnu/ld-2.15.so
7fff54ecf000-7fff54ef0000 rw-p 00000000 00:00 0                          [stack]
7fff54f73000-7fff54f75000 r--p 00000000 00:00 0                          [vvar]
7fff54f75000-7fff54f77000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r--p 00000000 00:00 0                  [vsyscall]
**==== CALL cluster_show ====**
NOTICE: using provided configuration file "/etc/repmgr/repmgr.conf"
ERROR: connection to database failed:
  could not connect to server: Connection refused
        Is the server running on host "astra2" (192.168.10.14) and accepting
        TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repmgr_usr dbname=repmgr_db host=astra2 connect_timeout=2 fallback_application_name=repmgr
 ID | Name   | Role    | Status        | Upstream | Location | Connection string                           
----+--------+---------+---------------+----------+----------+----------------------------------------------
 2  | astra2 | primary | ? unreachable |          | default  | host=astra2 dbname=repmgr_db user=repmgr_usr
 3  | astra3 | standby |   running     | astra2   | default  | host=astra3 dbname=repmgr_db user=repmgr_usr

WARNING: following issues were detected
  node "astra2" (ID: 2) is registered as an active primary but is unreachable
Press any key to continue...

after switchover attempt the postgresql is down on the primary node:

root@astra152:/etc/repmgr# service postgresql status
9.4/main (port 5432): down
nucfisher commented 6 years ago

The same issue is for REL_4_0_STABLE (4.0.1).

JonathanDIT commented 6 years ago

I'm seeing exact problem with 4.0.1 (4.0.1-1.pgdg14.04+1 0 from apt.postgresql.org). Using switchover promotes the current standby to primary but kills the primary it was promoting.

Host names redacted in the output below. BEFORE: ID | Name | Role | Status | Upstream | Location | Connection string
----+--------------------------+---------+-----------+--------------------------+----------+--------------------------------------- 1 | xxxxxxxxxxxxxxxxxxxx1 | standby | running | xxxxxxxxxxxxxxxxxxxx2 | default | host=xxx1 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5 2 | xxxxxxxxxxxxxxxxxxxx2 | primary | * running | | default | host=xxx2 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5 3 | xxxxxxxxxxxxxxxxxxxx3 | standby | running | xxxxxxxxxxxxxxxxxxxx2 | default | host=xxx3 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5

AFTER: ID | Name | Role | Status | Upstream | Location | Connection string ----+--------------------------+---------+-----------+--------------------------+----------+----------------------------------------------------------------- 1 | xxxxxxxxxxxxxxxxxxxx1 | primary | * running | | default | host=xxx1 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5 2 | xxxxxxxxxxxxxxxxxxxx2 | primary | - failed | | default | host=xxx2 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5 3 | xxxxxxxxxxxxxxxxxxxx3 | standby | running | xxxxxxxxxxxxxxxxxxxx1 | default | host=xxx3 port=5432 user=replicator dbname=repmgr sslmode=require connect_timeout=5

Our fail-over testing procedure requires this to work in order to fail-back after a fail-over. It's another of several disappointing bugs in 4.x which has been forced upon us via the APT repo due to the removal of 3.x :-(

Once this has happened, my recovery procedure is to re-clone the broken primary as a standby and reregister it, and then restart repmgrd on all nodes as it otherwise continues to believe that the old primary is still the primary (I'll raise a separate bug for this).

nucfisher commented 6 years ago

As of issues, assigned to 4.0.2, I would say that #354, #349 and #343 are P2 and #346 is P3. Looking forward to...

ibarwick commented 6 years ago

Issue now fixed; 4.0.2 release is scheduled for later next week.