2ndQuadrant / pglogical

Logical Replication extension for PostgreSQL 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
http://2ndquadrant.com/en/resources/pglogical/
Other
983 stars 153 forks source link

Issue with Resynchronizing data after connection error recovery #463

Closed bnol closed 5 months ago

bnol commented 5 months ago

Hello,

I recently encountered an issue with my pglogical setup where replication between two PostgreSQL servers was interrupted due to a connection error. After resolving the connection issue, I used pglogical.alter_subscription_synchronize() to restart the replication process. I am reaching out to ensure that this method resynchronizes the data that was not replicated during the downtime, in addition to the new data changes, and to seek advice on monitoring and verifying the integrity of the resynchronization process.

Thanks a lot!

bnol commented 5 months ago

After some researchs, AFAIK, pglogical will pick up and synchronize if data that was not replicated during the downtime is less than wal_keep_size and archive_mode=on.

Do I understand correctly?

bonesmoses commented 5 months ago

AFAIK, pglogical will pick up and synchronize if data that was not replicated during the downtime is less than wal_keep_size and archive_mode=on.

Where did you get this information? Pglogical requires the use of replication slots, and as such, will simply begin streaming logical changes from the last known slot position. If you dropped the slot, you cannot reestablish the subscription, and must recreate it from scratch. No setting of wal_keep_size or archive_mode has anything to do with this.

bnol commented 5 months ago

AFAIK, pglogical will pick up and synchronize if data that was not replicated during the downtime is less than wal_keep_size and archive_mode=on.

Where did you get this information? Pglogical requires the use of replication slots, and as such, will simply begin streaming logical changes from the last known slot position. If you dropped the slot, you cannot reestablish the subscription, and must recreate it from scratch. No setting of wal_keep_size or archive_mode has anything to do with this.

Got it from Gemini... :) thank you for the clarification, I'm pretty new to pglogical. Just out of curiosity, simply begin streaming logical changes from the last known slot position -> in case the last known slot position is too old, there should be a size limitation for the amount of data that kept in WAL, no? If yes, I think this should relate to wal_keep_size param as the postgresql documentation states

wal_keep_size (integer) Specifies the minimum size of past WAL files kept in the pg_wal directory, in case a standby server needs to fetch them for streaming replication. If a standby server connected to the sending server falls behind by more than wal_keep_size megabytes, the sending server might remove a WAL segment still needed by the standby, in which case the replication connection will be terminated. Downstream connections will also eventually fail as a result. (However, the standby server can recover by fetching the segment from archive, if WAL archiving is in use.)

https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-WAL-KEEP-SIZE

In my case, the data has been synchronized again after a while (2 hours).

bonesmoses commented 5 months ago

That's not how replication slots work. Replication slots cause Postgres to retain WAL permanently until the slot position moves. In fact, this caused a lot of problems with people running out of disk space due to a replica being offline or cut off due to a network outage. It's the entire reason the max_slot_wal_keep_size parameter was added in v13.

The wal_keep_size parameter was created long before slots existed, and it was a kind of ugly hack because it was possible for a replica to fall so far behind that even high values wouldn't be enough. Currently, everyone recommends using a backup tool like Barman or pgBackRest to archive WAL long term, and if a replica falls behind and isn't using a slot, it can pull from the archive rather than having the Primary node keep WAL files for eternity.

bonesmoses commented 5 months ago

In any case, I'm closing this issue as unrelated to pglogical. Philosophical or design discussions related to Postgres core can be had elsewhere.