Open ruoshan opened 5 months ago
I was able to get around this issue by creating the node
, node_interface
, subscription
, and local_sync_status
resources manually and not using the pglogical_create_subscriber
function. Then creating replication origin; replication slot on the remote node. I can then start the replication using alter_subscription_enable
function and it re-uses the existing replication slot.
Is there any other step that I miss or downside to use the above approach, for pglogical to properly work? I do not notice anything unusual ATM.
I think putting the pglogical.create_subscription() and set sync_status = 'r'
in the same transaction should be enough to fix this.
When using
pglogical_create_subscriber
to create a new logical subsrciber, there is a race condition between the two processes updating/reading thesync_status
field ofpglogical.local_sync_status
. The two processes are:pglogical_create_subscriber
processSELECT pglogical.create_subscription
There is a tiny time window for the "pglogical apply worker" to start its work using the wrong state. The time windows is at these lines in the
pglogical_subscribe
function of pglogcial_create_subscriber.c .When the first query in the referred lines executed, and kernel switch out the pglogical_create_subscriber process for too long. The "pglogical apply worker" will run the
pglogical_sync_subscription
function with sync_status == SYNC_STATUS_INIT . Started pglogical_sync_subscription with INIT status is not OK, as it will re-create the replication slot and use the new snapshot's LSN as the logical replication origin.I create a small patch to make the bug very easy to reproduce on a system that has data coming in during the logical replication creation. Here is the patch:
(the above patch is under MIT license)