Closed patrickmglynn closed 2 months ago
Can you please test the same scenario with pg_squeeze
1.4 (actually 1.4.1 is the last release)? It should consume less disk space (see https://github.com/cybertec-postgresql/pg_squeeze/issues/51#issue-1127110005 for details), so the processing might also be a bit faster. If the replication still gets stuck due to logical worker's exit, I suggest you to increase the wal_receiver_timeout
configuration variable (on the subscriber's side).
Thanks for the quick reply.
I will work on building the 1.4.1 release, in the meantime is there an .rpm available for this? I couldn't find anything in the postgres yum repo.
I thought the point of the issue https://github.com/cybertec-postgresql/pg_squeeze/issues/47#issue-910280352 was to add pg_squeeze 1.4 to the official repository, but it really appears not to be there. I've submitted a request to setup the build.
pg_squeeze 1.4 is now in the PGDG repository (yum). It can only be installed to PG 14 though. You need to install it from source if you use lower PG version.
I was able to install 1.4.1 from the official PG12 repository. The subscriber has wal_receiver_timeout
set to 5min, unfortunately replication lag remains fixed at the size of the table being squeezed once the squeeze process completes.
I will re-test in PG14 once we are able to move to this version.
patrikmglynn did you solve the problem? I've just hit exactly the same issue.
Stale, re-open if needed.
Hello,
I have observed that running pg_squeeze on large tables either manually or on a schedule, leads to replication lag building constantly and requires intervention to prevent continuous WAL growth.
Our environment is Postgresql 12.1 running under Redhat 7.9, with pg_squeeze version 1.3:
One publisher replicates data to one subscriber.
max_replication_slots
andmax_wal_senders
are both set to 10 on the publisher,max_wal_senders
is 20 on the subscriber and we currently have 6 subscriptions operating in total (one per schema to replicate).Prior to running squeeze the replication slots on the publisher are all active and the subscription state is streaming:
If I attempt to run the below on a large table within the sub1 publication that occupies 83GB on disk (68GB of this is bloat):
The operation finishes successfully (reducing the size of the table dramatically), but replication has now lagged during the squeeze process and then never recovers:
From the above we can see the slots are still active on the publisher but the subscription is now stuck in the catchup state (also the
restart_lsn
andconfirmed_flush_lsn
do not change) and the size of the pg_wal directory has now grown to 11GB and continues to grow over time.Checking the subscriber logs shows the replication workers timing out as the squeeze process runs:
wal_sender_timeout
is 5min on the publisher and 1min on the subscriber side.At this point the only way to get the subscriptions working again is to DROP and CREATE them again, thus losing the built up WAL files.
ALTER SUBSCRIPTION DISABLE/ENABLE
has no effect and neither does restarting the postgresql-12 service on the subscriber side.Is there a misconfiguration or setting I have overlooked here?
Please advise any suggestions as to how I can incorporate squeeze into our environment alongside our logical replication setup.