EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

switchover fails on LSN #703

Open piotrekfus91 opened 3 years ago

piotrekfus91 commented 3 years ago

Hi, I am trying to do switchover using repmgr. It stops primary node correctly, but after that it hangs during rewind:

postgres@8feb0787ba67:~$ repmgr -f /etc/postgresql/13/main/repmgr.conf standby switchover
NOTICE: executing switchover on node "db2" (ID: 2)
NOTICE: local node "db2" (ID: 2) will be promoted to primary; current primary "db1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "db1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "db1" (ID: 1)
DETAIL: executing server command "/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/4000028
NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk
INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 2 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 3 of maximum 30 seconds waiting for standby to flush received WAL to disk
[...]
INFO: sleeping 29 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk
WARNING: local node "db2" is behind shutdown primary "db1"
DETAIL: local node last receive LSN is 0/3D04000, primary shutdown checkpoint LSN is 0/4000028
NOTICE: aborting switchover
HINT: use --always-promote to force promotion of standby

I tried with --force-rewind=/usr/lib/postgresql/13/bin/pg_rewind, the result is the same. I also created a symlink sudo ln -s /usr/lib/postgresql/13/bin/pg_rewind /usr/bin/pg_rewind, but still to no avail.

repmgr 5.2.0 postgresql 13 ubuntu 20.04 (on docker) postgresql.override.conf:

listen_addresses = '*'

max_wal_senders = 10

max_replication_slots = 10

wal_level = 'replica'

hot_standby = on

archive_mode = on

archive_command = '/bin/true'

shared_preload_libraries = 'repmgr'

wal_log_hints = on

repmgr.conf:

node_id=2

node_name=db2

conninfo='host=db2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'

data_directory='/var/lib/postgresql/13/main'

failover=automatic

promote_command='repmgr standby promote -f /etc/postgresql/13/main/repmgr.conf --log-to-file'

follow_command='repmgr standby follow -f /etc/postgresql/13/main/repmgr.conf --log-to-file --upstream-node-id=%n'

service_start_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast start'
service_stop_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast stop'
service_restart_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast restart'

Any hints, how to solve this problem?

sandrobordacchini commented 2 years ago

Hi, i have the same issue here with:

(no docker involved, just plain vms)

Did you work out a solution? Thanks.

piotrekfus91 commented 2 years ago

I didn't, we plan to change repmgr to something else after half a year of no answer.

alien11689 commented 2 years ago

We had the same problem with WAL on postgres 13 and repmgr 5.3. It happens when Timeline is not equal on nodes:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+------------------
 1  | node1   | standby |   running | node2    | default  | 100      | 15       | ...
 2  | node2   | primary | * running |          | default  | 100      | 16       | ...

You can restart standby node:

node1$ sudo systemctl restart postgresql

and timeline will be equal on both nodes:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+------------------
1  | node1   | standby |   running | node2    | default  | 100      | 16       | ...
2  | node2   | primary | * running |          | default  | 100      | 16       | ...

Next switchover operation should be successful:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf standby switchover
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
NOTICE: executing switchover on node "node1" (ID: 1)
INFO: searching for primary node
INFO: checking if node 2 is primary
INFO: current primary node is 2
INFO: SSH connection to host "node2" succeeded
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node1" (ID: 1) will be promoted to primary; current primary "node2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node2" (ID: 2)
NOTICE: issuing CHECKPOINT on node "node2" (ID: 2) 
DETAIL: executing server command "sudo /usr/bin/systemctl stop postgresql"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 1/A8000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node1" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
INFO: standby promoted to primary after 1 second(s)
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node1" (ID: 1) was successfully promoted to primary
INFO: node "node2" (ID: 2) is pingable
INFO: node "node2" (ID: 2) has attached to its upstream node
NOTICE: node "node1" (ID: 1) promoted to primary, node "node2" (ID: 2) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node1" is now primary and node "node2" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

Result:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+-------------------
 1  | node1   | primary | * running |          | default  | 100      | 17       | ...
 2  | node2   | standby |   running | node1    | default  | 100      | 16       | ...
EamonZhang commented 2 years ago

@alien11689

I had the same problem, which could be solved by restarting the standby server or waiting a few minutes.

vyegorov commented 2 years ago

I hit the same issue.

Main reason here is the fictive archive_command, if you disable archiving — things works as expected.

To fix, just make archive_command = '{ sleep 5; true; }'. Smaller timeout might work as well. I am not sure whether this is an repmgr issue or there's a race inside PostgreSQL, though.

fonya commented 2 years ago

@vyegorov Thank you very much for your answer, that is the solution: archive_command = '{ sleep 5; true; }'

likingzi commented 2 years ago

I hit the same issue.

Main reason here is the fictive archive_command, if you disable archiving — things works as expected.

To fix, just make archive_command = '{ sleep 5; true; }'. Smaller timeout might work as well. I am not sure whether this is an repmgr issue or there's a race inside PostgreSQL, though.

Thank you ! Your reply also solved my same issue.