EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.57k stars 251 forks source link

Unable to start repmgr10 as systemd service. #346

Closed jan2ary closed 5 years ago

jan2ary commented 6 years ago

Hi,

I have two vagrant machines (actually CentOS 7.4.1708) and PostgreSQL 10.1 on both of them, working as a master-slave with repmgr (Installed from repmgr version 3.2 and then upgraded to 4.0.0). Not a production setup indeed. For now I'm unable to start repmgr as a systemd service due to the error:

Nov 28 12:58:17 pg10b systemd[1]: Starting A replication manager, and failover management tool for PostgreSQL...
-- Subject: Unit repmgr10.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit repmgr10.service has begun starting up.
Nov 28 12:58:17 pg10b repmgrd[2354]: [2017-11-28 12:58:17] [NOTICE] using provided configuration file "/etc/repmgr/10/repmgr.conf"
Nov 28 12:58:17 pg10b repmgrd[2354]: row number 0 is out of range 0..-1
Nov 28 12:58:17 pg10b repmgrd[2354]: [2017-11-28 12:58:17] [ERROR] unable to write to shared memory
Nov 28 12:58:17 pg10b repmgrd[2354]: [2017-11-28 12:58:17] [HINT] ensure "shared_preload_libraries" includes "repmgr"
Nov 28 12:58:17 pg10b systemd[1]: repmgr10.service: control process exited, code=exited status=1
Nov 28 12:58:17 pg10b systemd[1]: Failed to start A replication manager, and failover management tool for PostgreSQL.

The PostgreSQL has shared_preload_libraries with that value:

-bash-4.2$ psql -c "show shared_preload_libraries"
 shared_preload_libraries
--------------------------
 repmgr
(1 row)

Selinux is disabled.

jan2ary commented 6 years ago

Still reproduces at 4.0.1. Also repmgrd fails to start from shell

-bash-4.2$ repmgrd -v
INFO: checking for package configuration file "/etc/repmgr/10/repmgr.conf"
INFO: configuration file found at: "/etc/repmgr/10/repmgr.conf"
row number 0 is out of range 0..-1
[2017-12-13 13:13:51] [ERROR] unable to write to shared memory
[2017-12-13 13:13:51] [HINT] ensure "shared_preload_libraries" includes "repmgr"
-bash-4.2$ repmgrd -V
repmgrd 4.0.1
ibarwick commented 6 years ago

Odd; this works fine on a fresh installation.

Can you provide a copy of the repmgr.conf file?

jan2ary commented 6 years ago

Config file follows:

node_id=2
node_name=pg10b
data_directory='/var/lib/pgsql/10/data'
pg_ctl_options='-l /var/lib/pgsql/repmgr/switchover.log'
pg_bindir='/usr/pgsql-10/bin'
conninfo='host=pg10b user=repmgr dbname=repmgr connect_timeout=2'
use_replication_slots=1

failover=automatic
promote_command='/usr/pgsql-10/bin/repmgr standby promote --log-to-file'
follow_command='/usr/pgsql-10/bin/repmgr standby follow --log-to-file --upstream-node-id=%n'
jan2ary commented 6 years ago

repmgr.conf.txt

ibarwick commented 6 years ago

Hi

Removing this from the 4.0.2 milestone as we're unable to reproduce the issue.

This error:

-bash-4.2$ repmgrd -v
INFO: checking for package configuration file "/etc/repmgr/10/repmgr.conf"
INFO: configuration file found at: "/etc/repmgr/10/repmgr.conf"
row number 0 is out of range 0..-1

could indicate there's something is wrong with either the way the packaged extension SQL is installed, or some sort of permissions issue.

Please try executing this:

psql -d 'host=pg10b user=repmgr dbname=repmgr connect_timeout=2' -c "SELECT repmgr.get_local_node_id()"

and report the output.

Thanks.

jmehnle commented 6 years ago

I'm having this issue with PostgreSQL 9.3 and repmgr 4.0.1:

Jan 17 01:30:57 db repmgrd[31748]: connecting to database "host=db port=5432 user=repmgr"
Jan 17 01:30:57 db repmgrd[31748]: connecting to: "user=repmgr host=db port=5432 connect_timeout=2 fallback_application_name=repmgr"
Jan 17 01:30:57 db repmgrd[31748]: unable to write to shared memory
Jan 17 01:30:57 db repmgrd[31748]: repmgrd terminating...

For some weird reason repmgr.get_local_node_id() returns null:

$ psql -U repmgr -d repmgr -P 'null=(null)' -Atc 'SELECT repmgr.get_local_node_id()'
(null)

Our repmgr.conf:

# repmgr configuration file
# See <https://github.com/2ndQuadrant/repmgr/blob/master/repmgr.conf.sample>.
###############################################################################

pg_bindir='/usr/lib/postgresql/9.3/bin'
data_directory='/data/postgresql/9.3/main'

# Node ID
node_id=1
node_name=db

# Connection information
conninfo='host=db port=5432 user=repmgr'
rsync_options='--archive --checksum --compress --progress --rsh=ssh'
async_query_timeout=60
reconnect_attempts=6
reconnect_interval=10

# Autofailover options
failover=automatic

# Default: NOTICE
log_level=DEBUG

# Logging facility: possible values are STDERR or - for Syslog integration - one of LOCAL0, LOCAL1, ..., LOCAL7, USER
# Default: STDERR
log_facility=LOCAL7

# commands for repmgrd
promote_command='/etc/repmgr/switch-eip'
follow_command='repmgr standby follow'
jmehnle commented 6 years ago

It turns out that we no longer had repmgr (or, formerly, repmgr_funcs) in the shared_preload_libraries Pg option.

This was an accidental consequence of following the repmgr 4 upgrade instructions. I had jumped directly to https://repmgr.org/docs/4.0/upgrading-from-repmgr-3.html, which is linked directly from the https://repmgr.org/ frontpage and which said nothing about any changes that needed to be made to the shared_preload_libraries Pg option, and since Pg was raising errors that it couldn't find the repmgr_funcs.so library after upgrading to repmgr 4, and since I knew repmgr 4 now was a Pg extension, I simply assumed that this was no longer required at all.

Only today, when I ascended one level up in the docs to https://repmgr.org/docs/4.0/upgrading-repmgr.html did I see that …

If repmgrd is running, it may be necessary to restart the PostgreSQL server if the upgrade contains changes to the shared object file used by repmgrd; check the release notes for details.

I then followed the link to the 4.0.0 release notes and saw this:

The repmgr shared library has been renamed from repmgr_funcs to repmgr, meaning shared_preload_libraries in postgresql.conf needs to be updated to the new name: shared_preload_libraries = 'repmgr'

Once I updated the shared_preload_libraries option in postgresql.conf and restarted Pg, repmgrd started working again.

Assuming that the original reporter of this issue had a similar issue, I see two problems here:

  1. A documentation issue: https://repmgr.org/docs/4.0/upgrading-from-repmgr-3.html is linked to directly from the https://repmgr.org/ frontpage but says nothing about changes required to the shared_preload_libraries Pg option.
  2. A diagnostics issue: when repmgrd checks for whether the repmgr shared library has been loaded by Pg, the error message shown is useless to the average user: unable to write to shared memory. Supposedly there is a relevant hint logged, but in my case that did not show up in syslog at all; it's not clear why not.
jan2ary commented 6 years ago

For me things were even worse:

psql -d 'host=pg10b user=repmgr dbname=repmgr connect_timeout=2' -c "SELECT repmgr.get_local_node_id()"

ERROR:  42883: function repmgr.get_local_node_id() does not exist
LINE 1: SELECT repmgr.get_local_node_id()
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
LOCATION:  ParseFuncOrColumn, parse_func.c:528

I dropped extension repmgr, dropped all its tables (events) and created again, than re-registered the primary with --force key. And it fixed the issue. For me it seems as a wrong upgrade sequence. the doc wasn't clear or I wasn't careful enough to execute it properly.

jan2ary commented 6 years ago

For me things were even worse:

psql -d 'host=pg10b user=repmgr dbname=repmgr connect_timeout=2' -c "SELECT repmgr.get_local_node_id()"

ERROR:  42883: function repmgr.get_local_node_id() does not exist
LINE 1: SELECT repmgr.get_local_node_id()
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
LOCATION:  ParseFuncOrColumn, parse_func.c:528

I dropped extension repmgr, dropped all its tables (events) and created again, than re-registered the primary with --force key. And it fixed the issue. For me it seems as a wrong upgrade sequence: the doc wasn't clear or I wasn't careful enough to execute it properly.

vldanch commented 5 years ago

I have the same problem on the repmgr version 4.3, PG 9.5. How to solve the problem?