EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

"repmgr node check" fails when other (non-repmgr) slots are created #562

Closed gclough closed 5 years ago

gclough commented 5 years ago

We don't use replication slots for RepMgr, but we do for Barman and also for some Application purposes. The problem arises because repmgr node check will return a non-zero returncode for any slot that's not actively in use.

ENHANCEMENT REQUEST: Occasionally our application will stop using the slot, and I'm happy with that... but I need some way to tell RepMgr to ignore that. Can you get RepMgr either by default or via a configuration option to ONLY monitor slots that are concerned with RepMgr?

Versions

postgres@test_server_01[test_cluster_01:5432] ~/$ psql --version
psql (PostgreSQL) 10.7

postgres@test_server_01[test_cluster_01:5432] ~/$ repmgr --version
repmgr 4.2

Node check with just the Barman replication slot, returns OK:

postgres@test_server_01[test_cluster_01:5432] ~/$ repmgr node check
Node "test_server_01.uat.salesportal.int":
        Server role: OK (node is primary)
        Replication lag: OK (N/A - node is primary)
        WAL archiving: OK (0 pending archive ready files)
        Downstream servers: OK (this node has no downstream nodes)
        Replication slots: OK (1 of 1 replication slots are active)
        Missing replication slots: OK (node has no missing replication slots)

postgres@test_server_01[test_cluster_01:5432] ~/$ echo $?
0

Add a replication slot for the application, and node check now fails:

postgres@test_server_01[test_cluster_01:5432] ~/$ psql -c "SELECT pg_create_logical_replication_slot('application_slot','wal2json');"
 pg_create_logical_replication_slot
------------------------------------
 (application_slot,80/E00000D0)
(1 row)

postgres@test_server_01[test_cluster_01:5432] ~/$ repmgr node check
Node "test_server_01.uat.salesportal.int":
        Server role: OK (node is primary)
        Replication lag: OK (N/A - node is primary)
        WAL archiving: OK (0 pending archive ready files)
        Downstream servers: OK (this node has no downstream nodes)
        Replication slots: CRITICAL (1 of 2 replication slots are inactive)
        Missing replication slots: OK (node has no missing replication slots)

postgres@test_server_01[test_cluster_01:5432] ~/$ echo $?
25

Drop the slot, and it works again:

postgres@test_server_01[test_cluster_01:5432] ~/$ psql -c "SELECT pg_drop_replication_slot('application_slot');"
 pg_drop_replication_slot
--------------------------

(1 row)

postgres@test_server_01[test_cluster_01:5432] ~/$ repmgr node check
Node "test_server_01.uat.salesportal.int":
        Server role: OK (node is primary)
        Replication lag: OK (N/A - node is primary)
        WAL archiving: OK (0 pending archive ready files)
        Downstream servers: OK (this node has no downstream nodes)
        Replication slots: OK (1 of 1 replication slots are active)
        Missing replication slots: OK (node has no missing replication slots)

postgres@test_server_01[test_cluster_01:5432] ~/$ echo $?
0

Add the slot back, and --slots check explicitly fails:

postgres@test_server_01[test_cluster_01:5432] ~/$ psql -c "SELECT pg_create_logical_replication_slot('application_slot','wal2json');"
 pg_create_logical_replication_slot
------------------------------------
 (application_slot,80/E0000108)
(1 row)

postgres@test_server_01[test_cluster_01:5432] ~/$ repmgr node check --slots
CRITICAL (1 of 2 replication slots are inactive)

postgres@test_server_01[test_cluster_01:5432] ~/$ echo $?
2

Drop the slot, and it returns OK:

postgres@test_server_01[test_cluster_01:5432] ~/$ psql -c "SELECT pg_drop_replication_slot('application_slot');"
 pg_drop_replication_slot
--------------------------

(1 row)

postgres@test_server_01[test_cluster_01:5432] ~/$ repmgr node check --slots
OK (1 of 1 replication slots are active)

postgres@test_server_01[test_cluster_01:5432] ~/$ echo $?
0
ibarwick commented 5 years ago

We don't use replication slots for RepMgr, but we do for Barman and also for some Application purposes. The problem arises because repmgr node check will return a non-zero returncode for any slot that's not actively in use.

ENHANCEMENT REQUEST: Occasionally our application will stop using the slot, and I'm happy with that... but I need some way to tell RepMgr to ignore that. Can you get RepMgr either by default or via a configuration option to ONLY monitor slots that are concerned with RepMgr?

Versions

postgres@test_server_01[test_cluster_01:5432] ~/$ psql --version
psql (PostgreSQL) 10.7

postgres@test_server_01[test_cluster_01:5432] ~/$ repmgr --version
repmgr 4.2

As of version 4.3 (just released), repmgr only concerns itself with physical replication slots, and ignores logical ones:

$ repmgr --version
repmgr 4.3
# psql -c 'SELECT * from pg_replication_slots'
    slot_name     |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 application_slot | wal2json | logical   |  16385 | repmgr   | f         | f      |            |      |          564 | 0/18CCDA0   | 0/18CCDD8
(1 row)
$ repmgr node check --slots
OK (node has no physical replication slots)
$ echo $?
0

Any inactive physical replication slot is always a concern, no matter which application it was created by.

Does this resolve your issue?

gclough commented 5 years ago

Silly me, I should have used the latest version. Humble apologies. :-/

This does indeed fix this issue, and whilst I would agree that any idle replication slot is very bad, IMHO RepMgr should only concern itself with reporting a "CRITICAL" error for only those slots that it creates, or at a minimum only report errors when use_replication_slots=yes is configured. I'll agree that this may be a fringe opinion, so I'll close the issue. Many thanks.