EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.56k stars 251 forks source link

repmgr daemon status showing repmgrd as 'not running' #854

Open mRx-z3d opened 3 months ago

mRx-z3d commented 3 months ago

Hi,

I'm playing with AlloyDB Omni, which is a standard PGSQL wrapped in a container and packed with some GCP (Google) steroids. Everything is working well, I was able to build a simple config with Primary and a single Standby. I was also able to use repmgr to test the switchover and switchback operations - this also works fine. The problem starts when I try to use repmgr with automatic failover:

Versions: repmgr --version repmgr 5.4.1

postgres --version postgres (PostgreSQL) 15.5

Configuration: A) repmgrd content (/etc/default/repmgrd): REPMGRD_ENABLED=yes REPMGRD_CONF="/var/alloydb/config/repmgr.conf" REPMGRD_OPTS="--daemonize=false" REPMGRD_USER=postgres REPMGRD_BIN=/usr/bin/repmgrd REPMGRD_PIDFILE=/var/run/repmgrd.pid

B) repmgr cofiguration (/var/alloydb/config/repmgr.conf): failover=automatic promote_command='/usr/bin/repmgr standby promote -f /var/alloydb/config/repmgr.conf --log-to-file' follow_command='/usr/bin/repmgr standby follow -f /var/alloydb/config/repmgr.conf --log-to-file --upstream-node-id=%n' repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgrd' repmgrd_service_start_command='sudo /usr/bin/systemctl stop repmgrd' monitoring_history=yes log_level=INFO log_file='/var/log/postgres/repmgrd.log'

Sympthoms: I'm able to start the repmgrd service on both nodes:

on prim: repmgr -f /var/alloydb/config/repmgr.conf daemon start --verbose NOTICE: using provided configuration file "/var/alloydb/config/repmgr.conf" INFO: connecting to local node NOTICE: executing: "sudo /usr/bin/systemctl start repmgrd" NOTICE: repmgrd was successfully started

prim output: ● repmgrd.service - LSB: Start/stop repmgrd Loaded: loaded (/etc/init.d/repmgrd; generated) Active: active (running) since Mon 2024-06-24 04:24:39 EDT; 16min ago Docs: man:systemd-sysv-generator(8) Process: 10531 ExecStart=/etc/init.d/repmgrd start (code=exited, status=0/SUCCESS) Tasks: 1 (limit: 19151) Memory: 1.3M CPU: 532ms CGroup: /system.slice/repmgrd.service └─10536 /usr/lib/postgresql/15/bin/repmgrd --config-file /var/alloydb/config/repmgr.conf --daemonize=false

Jun 24 04:24:39 omnidbv-repli-03 systemd[1]: Starting LSB: Start/stop repmgrd... Jun 24 04:24:39 omnidbv-repli-03 repmgrd[10531]: Starting PostgreSQL replication management and monitoring daemon: repmgrd. Jun 24 04:24:39 omnidbv-repli-03 systemd[1]: Started LSB: Start/stop repmgrd.

on stby: repmgr -f /var/alloydb/config/repmgr.conf daemon start --verbose NOTICE: using provided configuration file "/var/alloydb/config/repmgr.conf" INFO: connecting to local node NOTICE: executing: "sudo /usr/bin/systemctl start repmgrd" NOTICE: repmgrd was successfully started

stby output: ● repmgrd.service - LSB: Start/stop repmgrd Loaded: loaded (/etc/init.d/repmgrd; generated) Active: active (running) since Mon 2024-06-24 04:24:39 EDT; 17min ago Docs: man:systemd-sysv-generator(8) Process: 10531 ExecStart=/etc/init.d/repmgrd start (code=exited, status=0/SUCCESS) Tasks: 1 (limit: 19151) Memory: 1.3M CPU: 567ms CGroup: /system.slice/repmgrd.service └─10536 /usr/lib/postgresql/15/bin/repmgrd --config-file /var/alloydb/config/repmgr.conf --daemonize=false

Jun 24 04:24:39 omnidbv-repli-03 systemd[1]: Starting LSB: Start/stop repmgrd... Jun 24 04:24:39 omnidbv-repli-03 repmgrd[10531]: Starting PostgreSQL replication management and monitoring daemon: repmgrd. Jun 24 04:24:39 omnidbv-repli-03 systemd[1]: Started LSB: Start/stop repmgrd.

repmgr extention is installed on both nodes: repmgr=# SELECT * FROM pg_extension; oid extname extowner extnamespace extrelocatable extversion extconfig extcondition
14204 plpgsql 10 11 f 1.0
99377 google_columnar_engine 10 2200 t 1.0
99567 google_db_advisor 10 2200 t 1.0
99661 hypopg 10 2200 t 1.3.2
50059 repmgr 47598 50058 f 5.4 {50060,50076,50083} {"","",""}
repmgr service status and daemon status are able to show the repmgrd PIDs but reporting repmgrd as 'not running' ID Name Role Status Upstream repmgrd PID Paused? Upstream last seen
1 omnidbv-03-n1 primary * running not running 52598 no n/a
2 omnidbv-03-n2 standby running omnidbv-03-n1 not running 10536 no 0 second(s) ago

Any clue why this can be happening? What types of checks repmgr is doing to get the daemon status (beside the repmgrd_is_running function)? Appreciate any help in debugging. BTW. why the logfile is reporting about: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid and not as configured: REPMGRD_PIDFILE=/var/run/repmgrd.pid,

mRx-z3d commented 3 months ago

@ibarwick any chance you could look into this? Many thanks.