EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.54k stars 251 forks source link

Issue encountered while adding script for split-brain prevention #846

Open seunofk opened 7 months ago

seunofk commented 7 months ago

Hello.

When there's a network interface card (NIC) failure,

I want to create a script that detects it in repmgrd and performs subsequent actions.

If the Primary DB loses its NIC connection for 10 seconds, I want the Primary DB to be forcibly terminated,

and the Standby DB to be promoted to take over.

However, although the Standby promotion occurs, the Primary DB does not stop.

Is repmgrd daemon unable to detect NIC disconnection?

  1. repmgr version : 5.3.3
  2. postgresql version : 15.3
#!/bin/bash

PRIMARY_IP="10.12.30.191"
STANDBY_IP="10.12.30.192"
REPMGR_CONFIG="/postgres15/app/postgres/etc/repmgr.conf"
PGLOG="/pglog/repmgrd.log"

function echodate() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')]"
}

# Function to stop PostgreSQL on primary server
function stop_primary_db() {
    echo "$(echodate) [FAILOVER] Stopping primary PostgreSQL database" >> "$PGLOG"
    repmgr -f "$REPMGR_CONFIG" node service --action=stop
}

# Check if primary server needs to be shut down
ping -c 1 -W 10 "$PRIMARY_IP" > /dev/null 2>&1
ping_exit_code=$?
if [ $ping_exit_code -ne 0 ]; then
    # Ping to primary server timed out or failed, stop PostgreSQL and exit
    stop_primary_db
    exit 0
fi

# No failover condition met, exit
echo "$(echodate) [FAILOVER] No failover condition met, continuing normal operation" >> "$PGLOG"
exit 0
election_rerun_interval=10
# =============================================================================
# Required configuration items
# =============================================================================
node_id=2
node_name='postgresdb192'
conninfo='host=postgresdb192 user=repmgr dbname=postgres connect_timeout=2'
data_directory='/postgres15/data'

#------------------------------------------------------------------------------
# Replication settings
#------------------------------------------------------------------------------
use_replication_slots=yes

#------------------------------------------------------------------------------
# Logging settings
#------------------------------------------------------------------------------
log_level=INFO
log_facility=STDERR
log_file='/pglog/repmgrd.log'

#------------------------------------------------------------------------------
# Environment/command settings
#------------------------------------------------------------------------------
pg_bindir='/postgres15/app/postgres/bin'

#------------------------------------------------------------------------------
# external command options
#------------------------------------------------------------------------------
pg_ctl_options='-s -l /dev/null'
ssh_options='-q -o ConnectTimeout=10'

#------------------------------------------------------------------------------
# Standby follow settings
#------------------------------------------------------------------------------
primary_follow_timeout=60

#------------------------------------------------------------------------------
# Failover and monitoring settings (repmgrd)
#------------------------------------------------------------------------------
failover=automatic
priority=100
reconnect_attempts=3
reconnect_interval=5
promote_command='repmgr standby promote -f /postgres15/app/postgres/etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /postgres15/app/postgres/etc/repmgr.conf -W --upstream-node-id=%n --log-to-file'
monitoring_history=true
failover_validation_command='/postgres15/app/postgres/etc/failover.sh'
election_rerun_interval=10
#degraded_monitoring_timeout=-1
stephan-hahn commented 5 months ago

Hi, how do you execute your script? You could also use child_nodes_connected_min_count to manage more types of failures.

Stephan