Logical Replication extension for PostgreSQL 17, 16, 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
Hi! I am writing in the hopes of getting some advice on how I can better debug the behaviour I'm encountering. We're using pglogical to replicate a production database into a separate instance we can use for reporting/analytics (all in GCP). Around once a day, sometimes far more frequently, we get the following logs in our replica instance:
2023-07-12 14:11:47.223 UTC [3036301]: [1-1] db=analytics_prod,user=[unknown] LOG: starting apply for subscription prod_subscription
2023-07-12 14:11:47.317 UTC [3036301]: [2-1] db=analytics_prod,user=[unknown] ERROR: incorrect binary data format
2023-07-12 14:11:47.317 UTC [3036301]: [3-1] db=analytics_prod,user=[unknown] CONTEXT: apply UPDATE in commit before 3472/6727B900, xid 418939469 committed at 2023-07-12 13:59:30.061784+00 (action #2) from node replorigin 1
2023-07-12 14:11:47.320 UTC [3036301]: [4-1] db=analytics_prod,user=[unknown] LOG: apply worker [3036301] at slot 1 generation 3752 exiting with error
2023-07-12 14:11:47.323 UTC [7]: [16482-1] db=,user= LOG: background worker \"pglogical apply 25026:3244215688\" (PID 3036301) exited with exit code 1
When this happens, replication goes down. I am able to work around it by reviewing/removing the topmost information in the slot on the publisher/production database side of things:
Then, I can re-enable the subscription on the replica side and it seems to be "fixed" although I'm sure some data was lost by my manual changes to the replication slot.
I've attempted to reproduce this on my local laptop a few times by restoring a production dump and setting up replication there, but my attempts are still unpredictable as to whether they will trigger the error in question.
My current working theory is that our organization's relatively recent usage of uuid as a primary key in our tables (we're slowly moving towards this and away from bigint PKs) could be causing this -- but this theory is only based on intuition, not proof just yet.
Is it possible that using uuid columns in our tables could cause this? Is there some form of supplementary logging I can turn up to determine the exact cause of this error? Is there some other/better means of debugging you would suggest?
Looking for any advice I can get that might help me get closer to solving this. Thanks in advance! :)
Hi! I am writing in the hopes of getting some advice on how I can better debug the behaviour I'm encountering. We're using
pglogical
to replicate a production database into a separate instance we can use for reporting/analytics (all in GCP). Around once a day, sometimes far more frequently, we get the following logs in our replica instance:When this happens, replication goes down. I am able to work around it by reviewing/removing the topmost information in the slot on the publisher/production database side of things:
Then, I can re-enable the subscription on the replica side and it seems to be "fixed" although I'm sure some data was lost by my manual changes to the replication slot.
I've attempted to reproduce this on my local laptop a few times by restoring a production dump and setting up replication there, but my attempts are still unpredictable as to whether they will trigger the error in question.
My current working theory is that our organization's relatively recent usage of
uuid
as a primary key in our tables (we're slowly moving towards this and away from bigint PKs) could be causing this -- but this theory is only based on intuition, not proof just yet.Is it possible that using
uuid
columns in our tables could cause this? Is there some form of supplementary logging I can turn up to determine the exact cause of this error? Is there some other/better means of debugging you would suggest?Looking for any advice I can get that might help me get closer to solving this. Thanks in advance! :)