dimitri / pgcopydb

Copy a Postgres database to a target Postgres server (pg_dump | pg_restore on steroids)
Other
1.16k stars 76 forks source link

Migration failed to create a replication slot due to concurrent transactions. Failed after 5 hours #819

Closed pranavidandu closed 2 months ago

pranavidandu commented 3 months ago

Migration failed to create snapshot due to concurrent transactions and failed after 5 hours. Can we have a timeout in these scenarios and inform the customer about this error?


2024-06-13 16:13:20.388 80 INFO   main.c:136                Running pgcopydb version 0.16 from "/usr/lib/postgresql/15/bin/pgcopydb"
2024-06-13 16:13:20.392 80 INFO   cli_common.c:1491         [SOURCE] Copying database from "<source-conn-string>"
2024-06-13 16:13:20.392 80 INFO   cli_common.c:1492         [TARGET] Copying database into "<target-conn-string>"
2024-06-13 16:13:20.407 80 INFO   copydb.c:105              Using work dir "xxxxxx"
2024-06-13 16:13:20.555 80 SQL    pgsql.c:502               Connecting to [source] "<source-conn-string>"
2024-06-13 21:37:45.518 80 ERROR  pgsql.c:4041              Failed to send CREATE_REPLICATION_SLOT command

Source server logs:

2024-06-13 16:13:20 UTC-666b1aa0.7a700-LOG: logical decoding found initial starting point at 6F4/32BC3A90
--
2024-06-13 16:13:20 UTC-666b1aa0.7a700-DETAIL: Waiting for transactions (approximately 1) older than 16601567 to end.

Pg stat sessions logs:

PreciseTimeStamp | database_name | application_name | state | wait_event_type | wait_event
-- | -- | -- | -- | -- | --
2024-06-13 16:17:00.5181900 | <dbname> | pgcopydb[79] /usr/lib/postgresql/15/bin/pgcopydb | active | Lock | transactionid
.
.
.
2024-06-13 21:33:38.3123537 | <dbname> | pgcopydb[79] /usr/lib/postgresql/15/bin/pgcopydb | active | Lock | transactionid

Repro link - https://issues.redhat.com/browse/DBZ-862

hanefi commented 3 months ago

Do you suggest we set lock_timeout only for the CREATE_REPLICATION_SLOT command sent over the replication connection? What do you think is a meaningful value?

Can you share the next lines of output from pgcopydb. I believe the next lines of the logs after Failed to send CREATE_REPLICATION_SLOT command: (notice the colon at the end) should contain more information on the error.

marikkan-microsoft commented 3 months ago

@pranavidandu Waiting for the logs as @hanefi mentioned

hanefi commented 2 months ago

Hi @pranavidandu ,

After some more internal discussions with the team we decided to keep the current behavior. Any operator can set lock_timeout setting in Postgres and see the errors sooner than the config value that is 5 hours in this example.

If you read my earlier message, you can see more details on the error message in our logs. PostgreSQL waits for locks, and reports that in their logs as well.