EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

Repmgr : It automatically promotes to new master but other standby stopped #667

Open m-jayson opened 4 years ago

m-jayson commented 4 years ago

I have an issue which I also posted in stackoverflow https://dba.stackexchange.com/questions/276557/repmgr-it-automatically-promotes-to-new-master-but-other-standby-stopped

However, I would like to understand what happened.

So I was able to test if my automatic failover works and it did. I terminated my primary container so my secondary container got promoted. Unfortunately, my third container stopped here is the log image

I'm running the official postgres docker image v10 and here is my repmgr.conf

NET_IF=`netstat -rn | awk '/^0.0.0.0/ {thif=substr($0,74,10); print thif;} /^default.*UG/ {thif=substr($0,65,10); print thif;}'`
NET_IP=`ifconfig ${NET_IF} | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'` 

HOSTNAME='postgres-'${my_node}

cat<<EOF > /etc/repmgr.conf
    node_id=${my_node}
    node_name=$HOSTNAME
    conninfo='host=${NET_IP} user=repmgr password=repmgr dbname=repmgr connect_timeout=2'
    data_directory='${PGDATA}'

    log_level=INFO
    log_facility=STDERR
    log_status_interval=300

    pg_bindir='/usr/lib/postgresql/10/bin'
    use_replication_slots=1

    failover=automatic
    promote_command='repmgr standby promote'
    follow_command='repmgr standby follow -W'
EOF

I also tried adding this

#   service_start_command='pg_ctl -D ${PGDATA} start'
#   service_stop_command='pg_ctl -D ${PGDATA} stop -m fast'
#   service_reload_command='pg_ctl -D ${PGDATA} reload'
#service_restart_command='pg_ctl -D ${PGDATA} restart -m fast'

but same result.

Hope someone could help me on this. Thanks,

ibarwick commented 4 years ago

At this point we haven't made any particular provision for repmgr to run in Docker, so it's possible there may be issues of one kind or another.

I also tried adding this

# service_start_command='pg_ctl -D ${PGDATA} start'
# service_stop_command='pg_ctl -D ${PGDATA} stop -m fast'
# service_reload_command='pg_ctl -D ${PGDATA} reload'
#service_restart_command='pg_ctl -D ${PGDATA} restart -m fast'

but same result.

Did you try adding these items without the leading #? I.e.

service_start_command='pg_ctl -D ${PGDATA} start'
service_stop_command='pg_ctl -D ${PGDATA} stop -m fast'
service_reload_command='pg_ctl -D ${PGDATA} reload'
service_restart_command='pg_ctl -D ${PGDATA} restart -m fast'

By default, when restarting a node for a standby follow operation, repmgr will stop then start the server using pg_ctl, as pg_ctl restart has proven to be problematic in some environments. However the opposite might be the case here. Either way we strongly recommend using the OS level service commands where available to avoid issues like this (not sure if those would be available here).

ibarwick commented 4 years ago

Also I see from the Stackoverflow post you're using repmgr 5.0; we strongly recommend using repmgr 5.1, the latest version.

m-jayson commented 4 years ago

@ibarwick yes i tried using without '#'

for the repmgr here is how i download it

RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg main 10" \
          >> /etc/apt/sources.list.d/pgdg.list
RUN apt-get update; apt-get install -y postgresql-10-repmgr repmgr-common

Could you please help me? where can i download it?

m-jayson commented 4 years ago

for the 5.1 version?. I assume the commands would be the same for repmgr it's just the version we are changing

anyway i found it.

RUN curl https://dl.2ndquadrant.com/default/release/get/deb | bash
RUN apt-get update && apt-get install postgresql-11-repmgr repmgr-common -y

i'll try the changes you recommend and get back to you later

m-jayson commented 4 years ago

@ibarwick it seems that the docker image don't have systemctl command in the image. I also updated the version to 5.1 but still no luck

ibarwick commented 4 years ago

In that case I'm not sure what can be done. As stated before, we haven't tested this on Docker at all, so it's hard to see what the issue might be. If time permits I'll see if I can reproduce this later in the week, but can't promise anything.

m-jayson commented 4 years ago

@ibarwick thanks.. how do you start the repmgr btw?

this is how i do it

#!/bin/bash

repmgrd -v 
ibarwick commented 4 years ago

Aha, if you start it like that, it's probably not daemonizing properly.

Try something like:

repmgrd -f /etc/repmgr.conf --daemonize --pid-file=/tmp/repmgrd.pid >> /tmp/repmgrd.log 2>&1
m-jayson commented 4 years ago

oh thnx.. i'll give it a try

On Tue, Oct 6, 2020, 7:11 PM Ian Barwick notifications@github.com wrote:

Aha, if you start it like that, it's probably not daemonizing properly.

Try something like:

repmgrd -f /etc/repmgr.conf --daemonize --pid-file=/tmp/repmgrd.pid >> /tmp/repmgrd.log 2>&1

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/2ndQuadrant/repmgr/issues/667#issuecomment-704197977, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBKRMYV6DRASFUK765PHZDSJL3OJANCNFSM4SFA4KUQ .

m-jayson commented 4 years ago

@ibarwick do we have to stop the pg server whenever we are registering a node as primary or standby?

m-jayson commented 4 years ago

@ibarwick i think i have fixed it already image

Thanks for your help.

Now I still have another task to do:

  1. is this line i think this is dirty.

    repmgrd --verbose >> /tmp/repmgrd.log 2>&1
    tail -f /tmp/repmgrd.log

    I have to tail on the log because docker container exists right away

  2. When i put down the 1st node. then put it back again it says still primary if you have an approach on that to make it standby instead since a new primary has already been elected already that would be such a great help for me.

Thanks,