EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

old primary node rejoin after failover uncompatitable with pg_rewind #829

Open Zhaoxun opened 1 year ago

Zhaoxun commented 1 year ago

First I create role 'rep' for repmgr to use like these:

CREATE USER rep replication;
CREATE database repmgr WITH OWNER rep;
CREATE EXTENSION repmgr;
 GRANT pg_checkpoint TO rep;
 GRANT pg_read_all_stats TO rep;
GRANT EXECUTE ON function pg_catalog.pg_ls_dir(text, boolean, boolean) TO rep;
GRANT EXECUTE ON function pg_catalog.pg_stat_file(text, boolean) TO rep;
GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text) TO rep;
GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, boolean) TO rep;

So 'rep' is the least privileged user according to repmgr's official document and postgresql's pg_rewind documentation.

However, after encountering the strange error occurred on both pg15 and pg16 like these:

repmgr node rejoin --force-rewind -d 'host=192.168.1.2 port=5432 user=rep dbname=repmgr connect_timeout=5'
NOTICE: rejoin target is node "yzx2" (ID: 2)
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/81C6D80
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-15/bin/pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 port=5432 user=rep dbname=repmgr connect_timeout=5'"
ERROR: pg_rewind execution failed
DETAIL: pg_rewind: error: could not fetch remote file "global/pg_control": ERROR:  permission denied for function pg_read_binary_file

https://serverfault.com/questions/1144242/

I re-checked the documentation for pg_rewind again, and found a line:

-source-server=connstr Specifies a libpq connection string to connect to the source PostgreSQL server to synchronize the target with. The connection must be a normal (non-replication) connection with a role having sufficient permissions to execute the functions used by pg_rewind on the source server (see Notes section for details) or a superuser role.

So I guess it is the 'replication role' that caused the issue, thus ALTER ROLE rep NOREPLICATION;

But now the rewind encounters another problem - it won't process unless it is a replication role!

 repmgr node rejoin --force-rewind -d 'host=172.17.1.2 port=5432 user=rep dbname=repmgr connect_timeout=5'
NOTICE: rejoin target is node "yzx2" (ID: 2)
ERROR: connection to database failed
DETAIL:
connection to server at "172.17.1.2", port 5432 failed: FATAL:  must be superuser or replication role to start walsender

ERROR: unable to establish a replication connection to the rejoin target node

So either way it does not work! repmgr =the latest version 5.4dev on both pg15 and pg16!

Zhaoxun commented 1 year ago

BTW , I tested pure pg_rewind with noreplication role rep and it worked:

pg_rewind -D '/pgdata' --source-server='host=172.17.1.2 port=5432 user=rep dbname=postgres connect_timeout=5'
pg_rewind: servers diverged at WAL location 0/81C4E60 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/81C4DE8 on timeline 1
pg_rewind: Done!

So for sheer pg_rewind it does not need a replication role, and on the contrary, it needs a noreplication role. But repmgr node rejoin requires to execute pg_rewind as well as using a replication role! Thus it would fail no matter the role is replication or noreplication!

stephan-hahn commented 7 months ago

Altering the user from replication to noreplication doesn't fix the problem of missing execution permissions. The easy way is to make rep a superuser.