EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

Repmgr fails to attach my replica to the current master #824

Closed NielsKSchjoedt closed 1 year ago

NielsKSchjoedt commented 1 year ago

I have a cluster of some posgtgresql databases and a Barman server for backup, managed with Repmgr. I'm currently in the process of attaching a new replica. The replica is currently cloning from Barman and is still catching up on replication lag (behind with ~1 day atm.). Now I would like to attach it to the master - however I am running into some issues:

Normally after cloning using barman I would do

postgres@psql-09:~$ repmgr standby follow --upstream-node-id 7
ERROR: unable to retrieve record for local node 9
postgres@psql-09:~$ repmgr standby register
INFO: connecting to local node "psql-09" (ID: 9)
INFO: connecting to primary database
ERROR: node 9 is already registered
HINT: use option -F/--force to overwrite an existing node record

But as you can see that node 9 is already registered somehow. When I run in repmgr database on any other postgres server than node 9:

repmgr=# select node_id from repmgr.nodes;
 node_id
---------
       6
       7
       5
       8
       9
(5 rows)

I can see that number 9 is there. when I do it on node 9 however:

repmgr=#  select node_id from repmgr.nodes;
 node_id
---------
       6
       7
       5
       8
(4 rows)

there is everything except for number 9.

I can force registering as suggested

postgres@psql-09:~$ repmgr -d repmgr standby register -F
INFO: connecting to local node "psql-09" (ID: 9)
WARNING: database connection parameters not required when the standby to be registered is running
DETAIL: repmgr uses the "conninfo" parameter in "repmgr.conf" to connect to the standby
INFO: connecting to primary database
INFO: standby registration complete
NOTICE: standby node "psql-09" (ID: 9) successfully registered

But it doesn’t change anything. I also tried to run standby follow:

postgres@psql-09:~$ repmgr standby follow
ERROR: unable to retrieve record for local node 9

My next idea is to simply remove these records And try again I also tried to clone it again only with the config

postgres@psql-09:~$ repmgr -d repmgr standby clone --replication-conf-only -F
NOTICE: destination directory "/var/lib/postgresql/15/main" provided
WARNING: creating replication configuration in an active data directory
ERROR: unable to retrieve node record for local node 9
HINT: standby must be registered before replication can be configured

But it requires record for node 9.

How can I fix this problem?

PS: we are running repmgr version 5.3.3 and pg 15.1

NielsKSchjoedt commented 1 year ago

Turned out that it worked once the replication lag was smaller