2ndQuadrant / pglogical

Logical Replication extension for PostgreSQL 17, 16, 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
http://2ndquadrant.com/en/resources/pglogical/
Other
1.01k stars 153 forks source link

failed to increase restart lsn error #342

Open carolguo-dd opened 3 years ago

carolguo-dd commented 3 years ago

Hi, I have a table with 21million rows and it fails to replicate. It took about 20mins for the initial sync up process, then when it fails to replicate with below error from subscriber side

021-09-02 02:28:19 UTC::@:[12358]:DEBUG: registering background worker "pglogical apply 16396:1806104921"
2021-09-02 02:28:19 UTC::@:[12358]:DEBUG: starting background worker process "pglogical apply 16396:1806104921"
2021-09-02 02:28:19 UTC::@:[12358]:DEBUG: unregistering background worker "pglogical apply 16396:1806104921"
2021-09-02 02:28:19 UTC::@:[12358]:LOG: background worker "pglogical apply 16396:1806104921" (PID 21463) exited with exit code 1

The error from provider side is

2021-09-02 02:34:19.298 UTC [26328]  LOG:  logical decoding found consistent point at 2001/AE09AC58
2021-09-02 02:34:19.298 UTC [26328]  DETAIL:  Logical decoding will begin using saved snapshot.
2021-09-02 02:34:19.303 UTC [26328]  LOG:  could not receive data from client: Connection reset by peer
2021-09-02 02:34:19.303 UTC [26328]  LOG:  unexpected EOF on standby connection

I tested replicating another table with similar row count, it has been successful.

Any idea what could be wrong or how to get more information for troubleshoot?

petere commented 3 years ago

Please provide the versions of pglogical and PostgreSQL.

pdomagala commented 3 years ago

same issue, on two RDS instances:

publisher: PostgreSQL 12.7 and pglogical v2.3.2 subscriber: PostgreSQL 13.3 and pglogical v2.3.3

carolguo-dd commented 3 years ago

publisher: postgres 9.6.15, pglogical 2.3.4 subscribe: rds postgres 12.7, pglogical 2.3.2 It seems very odd that it only fails on this specific table, while successes on all other tables.

carolguo-dd commented 3 years ago

turned on debug2 on provider side and get a bit more error message from provider:

2021-09-07 14:20:14.474 UTC [19265] DETAIL:  Logical decoding will begin using saved snapshot.
2021-09-07 14:20:14.475 UTC [19265] DEBUG:  failed to increase restart lsn: proposed 2020/9E067D10, after 2020/9E067D10, current candidate 2020/9E067D10, current after 2020/9E067D10, flushed up to 2020/9E0640D8
2021-09-07 14:20:14.477 UTC [19265] DEBUG:  failed to increase restart lsn: proposed 2020/9E0EBCA0, after 2020/9E0EBCA0, current candidate 2020/9E067D10, current after 2020/9E067D10, flushed up to 2020/9E0640D8
2021-09-07 14:20:14.477 UTC [19265] LOG:  could not receive data from client: Connection reset by peer
2021-09-07 14:20:14.477 UTC [19265] LOG:  unexpected EOF on standby connection
2021-09-07 14:23:14.589 UTC [21242]DEBUG:  received replication command: IDENTIFY_SYSTEM
2021-09-07 14:23:14.590 UTC [21242] DEBUG:  received replication command: START_REPLICATION SLOT "pgl_dogdatastaging_host_pshedb9e07_subscription_vs2" LOGICAL 2020/9E060100 (expected_encoding 'UTF8', min_proto_version '1', max_proto_version '1', startup_params_format '1', "binary.want_internal_basetypes" '1', "binary.want_binary_basetypes" '1', "binary.basetypes_major_version" '1200', "binary.sizeof_datum" '8', "binary.sizeof_int" '4', "binary.sizeof_long" '8', "binary.bigendian" '0', "binary.float4_byval" '1', "binary.float8_byval" '1', "binary.integer_datetimes" '0', "hooks.setup_function" 'pglogical.pglogical_hooks_setup', "pglogical.forward_origins" '"all"', "pglogical.replication_set_names" 'vs_2', "relmeta_cache_size" '-1', pg_version '120007', pglogical_version '2.3.2', pglogical_version_num '20302', pglogical_apply_pid '21913')
2021-09-07 14:23:14.590 UTC [21242] DEBUG:  cannot stream from 2020/9E060100, minimum is 2020/9E0640D8, forwarding
2021-09-07 14:23:14.591 UTC [21242] DEBUG:  Unrecognised pglogical parameter startup_params_format ignored
2021-09-07 14:23:14.591 UTC [21242] CONTEXT:  slot "pgl_dogdatastaging_host_pshedb9e07_subscription_vs2", output plugin "pglogical_output", in the startup callback
2021-09-07 14:23:14.591 UTC [21242] DEBUG:  Unrecognised pglogical parameter relmeta_cache_size ignored
2021-09-07 14:23:14.591 UTC [21242] CONTEXT:  slot "pgl_dogdatastaging_host_pshedb9e07_subscription_vs2", output plugin "pglogical_output", in the startup callback
2021-09-07 14:23:14.591 UTC [21242] DEBUG:  Unrecognised pglogical parameter pglogical_version ignored
2021-09-07 14:23:14.591 UTC [21242] CONTEXT:  slot "pgl_dogdatastaging_host_pshedb9e07_subscription_vs2", output plugin "pglogical_output", in the startup callback
2021-09-07 14:23:14.591 UTC [21242] DEBUG:  Unrecognised pglogical parameter pglogical_version_num ignored
carolguo-dd commented 3 years ago

I think the main problem may be this

cannot stream from 2020/9E060100, minimum is 2020/9E0640D8, forwarding

But I have checked with select pg_xlogfile_name('2020/9E0640D8'); select pg_xlogfile_name('2020/9E060100'); they point to the same wal log file which is right on the pg_xlog directory

pdomagala commented 3 years ago

guys any update on this? it's a critical issue for us.

mmarcinkowski commented 2 years ago

@carolguo-dd did you find a solution? I have exactly the same issue.

jaskinn commented 2 years ago

Same issue on RDS

publisher: PostgreSQL 9.6.22 and pglogical v2.2.2 subscriber: PostgreSQL 13.4 and pglogical v2.4.0

yannh commented 2 years ago

@carolguo-dd - I have a similar issue. Could you confirm if while the initial sync was happening, the database was experiencing significant amount of writes? I am wondering if pglogical might do a COPY up to the LSN from when the subscription was created - and then once that is done (potentially quite some time later in your case), switch to WAL replication from the primary, for which it would need the primary to have all logs since the subscription was started. TLDR: I am wondering if a solution might be to increase WAL retention on the primary :thinking:

martinmarques commented 2 years ago

pglogical creates a slot when the subscription is created, and before copying any data, so all needed WALs will be available on the primary during catchup time after the initial COPY is over