EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

Standby resyncs with Primary Node at every Restart #858

Open aviralsingh21 opened 2 months ago

aviralsingh21 commented 2 months ago

I have a docker swarm HA architecture with setup of 3 nodes of PostgreSQL, 1 pgpool-II service and various other services. PostgreSQL is setup in HA Cluster using Replication Manager (repmgr) tool. 1 Primary Node + 1 Standby Node + 1 Witness Node

Docker Image Used: bitnami/postgresql-repmgr:16.3.0

Issue: Standby resyncs with Primary Node at every Restart of docker services.

What I was planning to do is to perform a graceful shutdown of postgresql database and then stop the container. In the process of shutting down the database at primary node (node-1), as soon it was shutdown then container got exited and database started as with new container id with a standby role and started to re-sync with new primary(node-2). I assumed this is normal. Since everytime container was restarting at every db shutdown try, I thought it will be better to first stop the repmgr daemon to permanently stop the database. But this didn't help.

I didn't get the permanent way to perform graceful shutdown of database before stopping docker service of postgresql. I didn't get the solution for it but I discovered another issue where whenever I restart the postgresql docker service, standby node (node-1) re-syncs (performs cloning) every single time with primary node (node-1).

PostgreSQL Logs from Standby Node:

^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.65 ^[[0m^[[38;5;2mINFO ^[[0m ==>
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> ^[[1mWelcome to the Bitnami postgresql-repmgr container^[[0m
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> Subscribe to project updates by watching ^[[1m[https://github.com/bitnami/containers^[[0m](https://github.com/bitnami/containers%5E[[0m)
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> Submit issues and feature requests at ^[[1m[https://github.com/bitnami/containers/issues^[[0m](https://github.com/bitnami/containers/issues%5E[[0m)
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit ^[[1m[https://bitnami.com/enterprise^[[0m](https://bitnami.com/enterprise%5E[[0m)
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==>
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.69 ^[[0m^[[38;5;2mINFO ^[[0m ==> ** Starting PostgreSQL with Replication Manager setup **
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Validating settings in REPMGR_* env vars...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Validating settings in POSTGRESQL_* env vars..
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Querying all partner nodes for common upstream node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.84 ^[[0m^[[38;5;2mINFO ^[[0m ==> Auto-detected primary node: 'pg-0:5432'
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.84 ^[[0m^[[38;5;2mINFO ^[[0m ==> Node configured as standby
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.85 ^[[0m^[[38;5;2mINFO ^[[0m ==> Preparing PostgreSQL configuration...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.85 ^[[0m^[[38;5;2mINFO ^[[0m ==> postgresql.conf file not detected. Generating it...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.90 ^[[0m^[[38;5;2mINFO ^[[0m ==> Preparing repmgr configuration...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.91 ^[[0m^[[38;5;2mINFO ^[[0m ==> Initializing Repmgr...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.91 ^[[0m^[[38;5;2mINFO ^[[0m ==> Waiting for primary node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.93 ^[[0m^[[38;5;2mINFO ^[[0m ==> Rejoining node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.93 ^[[0m^[[38;5;2mINFO ^[[0m ==> Cloning data from primary node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.33 ^[[0m^[[38;5;2mINFO ^[[0m ==> Initializing PostgreSQL database...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.35 ^[[0m^[[38;5;2mINFO ^[[0m ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.35 ^[[0m^[[38;5;2mINFO ^[[0m ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.39 ^[[0m^[[38;5;2mINFO ^[[0m ==> Deploying PostgreSQL with persisted data...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.45 ^[[0m^[[38;5;2mINFO ^[[0m ==> Configuring replication parameters
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.48 ^[[0m^[[38;5;2mINFO ^[[0m ==> Configuring fsync
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.49 ^[[0m^[[38;5;2mINFO ^[[0m ==> Setting up streaming replication slave...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.51 ^[[0m^[[38;5;2mINFO ^[[0m ==> Starting PostgreSQL in background...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.49 ^[[0m^[[38;5;2mINFO ^[[0m ==> Unregistering standby node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.62 ^[[0m^[[38;5;2mINFO ^[[0m ==> Registering Standby node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Stopping PostgreSQL...
waiting for server to shut down.... done
server stopped
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.84 ^[[0m^[[38;5;2mINFO ^[[0m ==> ** PostgreSQL with Replication Manager setup finished! **

^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.89 ^[[0m^[[38;5;2mINFO ^[[0m ==> Starting PostgreSQL in background...

I also compared logs of standby with other same environment's standby node which is not facing such issue. Logs are same as above, just 'Rejoining Node...' log does not exist there.

Additional information: I have already reviewed other relevant issues. Like #52213, #34986. I configured pg_rewind and enabled wal_log_hints. But situation is still same. I tested with bitnami/postgresql-repmgr:12.4.0 docker imager. Same situation is there also. I also deleted the volume and deployed the postgresql service with fresh volume, restored the database again. This time I directly stopped the docker service instead of stopping database first. But still I am facing same issue. Database Size used for testing: Around 60GB.

REPMGR Cluster

How to tackle this situation, anyone can please help me with this situation?

JP95Git commented 2 days ago

I had a similar problem. When I reboot my primary e.g. for updating the Linux kernel, the secondary is promoted to primary. To "fix" this, I pause the service before reboot and unpause after reboot.

Pause service (execute on ONE node of the cluster): /path/to/binary/repmgr --config-file=/path/to/config/repmgr.conf service pause

Unpause/continue service (execute on ONE node of the cluster): /path/to/binary/repmgr --config-file=/path/to/config/repmgr.conf service unpause