dolthub / doltgresql

DoltgreSQL - Version Controlled PostgreSQL
Apache License 2.0
997 stars 20 forks source link

Unable to set up replication #332

Open deathemperor opened 1 month ago

deathemperor commented 1 month ago

I've tried with 2 types of OS without success, each with its own problem. Both are doltgres 0.7.5

Ubuntu 20.04:

config.yaml:

listener:
  host: 127.0.0.1
  port: 5455
postgres_replication:
  postgres_server_address: 127.0.0.1
  postgres_user: postgres
  postgres_password: 123456
  postgres_database: postgres
  postgres_port: 5433
  slot_name: doltgres_slot_test
behavior:
  dolt_transaction_commit: true
log_level: debug

log of running doltgres --data-dir /vol2/dolgresql/data

Starting replication
INFO[0000] Server ready. Accepting connections.
WARN[0000] secure_file_priv is set to "", which is insecure.
WARN[0000] Any user with GRANT FILE privileges will be able to read any file which the sql-server process can read.
WARN[0000] Please consider restarting the server with secure_file_priv set to a safe (or non-existent) directory.
INFO[0000] NewConnection                                 DisableClientMultiStatements=false connectionID=1
INFO[0000] NewConnection                                 DisableClientMultiStatements=false connectionID=2
INFO[0000] ConnectionClosed                              connectionID=1
DEBU[0000] Starting query                                connectTime="2024-05-31 08:05:51.039836801 +0000 UTC m=+0.077038967" connectionID=2 query="USE `doltgres`;"
DEBU[0000] Query finished in 0 ms                        connectTime="2024-05-31 08:05:51.039836801 +0000 UTC m=+0.077038967" connectionDb=doltgres connectionID=2 query="USE `doltgres`;"
2024/05/31 08:05:51 Starting replicator
2024/05/31 08:05:51 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:05:51 Logical replication started on slot doltgres_slot_test
2024/05/31 08:05:51 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:05:51.047129 +0000 UTC ReplyRequested: false
2024/05/31 08:05:51 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:05:51.047342 +0000 UTC ReplyRequested: false
2024/05/31 08:06:01 Error: write failed: write tcp 127.0.0.1:43912->127.0.0.1:5433: i/o timeout. Retrying
2024/05/31 08:06:01 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:06:01 Logical replication started on slot doltgres_slot_test
2024/05/31 08:06:01 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:06:01 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:06:01.047826 +0000 UTC ReplyRequested: false
2024/05/31 08:06:01 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:06:01.049908 +0000 UTC ReplyRequested: false
2024/05/31 08:06:11 Error: write failed: write tcp 127.0.0.1:59418->127.0.0.1:5433: i/o timeout. Retrying
2024/05/31 08:06:11 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:06:11 Logical replication started on slot doltgres_slot_test
2024/05/31 08:06:11 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:06:11 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:06:11.057484 +0000 UTC ReplyRequested: false
2024/05/31 08:06:21 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:06:21 Error: conn busy. Retrying
2024/05/31 08:06:21 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:06:21 Logical replication started on slot doltgres_slot_test
2024/05/31 08:06:21 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:06:21.066913 +0000 UTC ReplyRequested: false
2024/05/31 08:06:21 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:06:21.067196 +0000 UTC ReplyRequested: false
2024/05/31 08:06:31 Error: write failed: write tcp 127.0.0.1:39032->127.0.0.1:5433: i/o timeout. Retrying
2024/05/31 08:06:31 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:06:31 Logical replication started on slot doltgres_slot_test
2024/05/31 08:06:31 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:06:31 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:06:31.063825 +0000 UTC ReplyRequested: false
2024/05/31 08:06:31 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:06:31.065772 +0000 UTC ReplyRequested: false
2024/05/31 08:06:41 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:06:41 Error: conn busy. Retrying
2024/05/31 08:06:41 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:06:41 Logical replication started on slot doltgres_slot_test
2024/05/31 08:06:41 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:06:41.072082 +0000 UTC ReplyRequested: false
2024/05/31 08:06:41 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:06:41.072171 +0000 UTC ReplyRequested: false
2024/05/31 08:06:51 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:06:51 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:06:51.065432 +0000 UTC ReplyRequested: false
2024/05/31 08:07:01 Error: write failed: write tcp 127.0.0.1:56734->127.0.0.1:5433: i/o timeout. Retrying
2024/05/31 08:07:01 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:07:01 Logical replication started on slot doltgres_slot_test
2024/05/31 08:07:01 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:07:01 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:07:01.074074 +0000 UTC ReplyRequested: false
2024/05/31 08:07:11 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:07:11 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:07:11.074933 +0000 UTC ReplyRequested: false
2024/05/31 08:07:21 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:07:21 Error: conn busy. Retrying
2024/05/31 08:07:21 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:07:21 Logical replication started on slot doltgres_slot_test
2024/05/31 08:07:21 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:07:21.08955 +0000 UTC ReplyRequested: false
2024/05/31 08:07:21 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:07:21.089704 +0000 UTC ReplyRequested: false
2024/05/31 08:07:31 Error: write failed: write tcp 127.0.0.1:51928->127.0.0.1:5433: i/o timeout. Retrying
2024/05/31 08:07:31 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:07:31 Logical replication started on slot doltgres_slot_test
2024/05/31 08:07:31 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:07:31 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:07:31.082044 +0000 UTC ReplyRequested: false
2024/05/31 08:07:31 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:07:31.082326 +0000 UTC ReplyRequested: false
2024/05/31 08:07:41 Error: write failed: write tcp 127.0.0.1:39222->127.0.0.1:5433: i/o timeout. Retrying
2024/05/31 08:07:41 Starting logical replication on slot doltgres_slot_test at WAL location 0/1
2024/05/31 08:07:41 Logical replication started on slot doltgres_slot_test
2024/05/31 08:07:41 Sent Standby status message with WALWritePosition = 0/1, WALApplyPosition = 4/C62022B9
2024/05/31 08:07:41 Primary Keepalive Message => ServerWALEnd: 0/1 ServerTime: 2024-05-31 08:07:41.088457 +0000 UTC ReplyRequested: false
2024/05/31 08:07:41 Primary Keepalive Message => ServerWALEnd: 4/C62022B8 ServerTime: 2024-05-31 08:07:41.088672 +0000 UTC ReplyRequested: false
2024/05/31 08:07:51 Error during replication: write failed: write tcp 127.0.0.1:58328->127.0.0.1:5433: i/o timeout
2024/05/31 08:07:51 shutting down replicator
DEBU[0120] Received message: Terminate
INFO[0120] ConnectionClosed                              connectionID=2

I followed replication instruction here: https://docs.doltgres.com/guides/replication-from-postgres. Trying to test with employees table

=================================

MacOS M3 Max version 14.5

config.yaml:

listener:
  host: 127.0.0.1
  port: 5456
postgres_replication:
  postgres_server_address: 127.0.0.11
  postgres_user: postgres1
  postgres_password: password1
  postgres_database: postgres1
  postgres_port: 54322
  slot_name: doltgres_slot_test2
behavior:
  dolt_transaction_commit: true

logs of running doltgres -config config.yaml

Starting replication
INFO[0000] Server ready. Accepting connections.
WARN[0000] secure_file_priv is set to "", which is insecure.
WARN[0000] Any user with GRANT FILE privileges will be able to read any file which the sql-server process can read.
WARN[0000] Please consider restarting the server with secure_file_priv set to a safe (or non-existent) directory.

It's supposed to throw error because the postgres_server_address and its info is incorrect, as well as slot name. In Ubuntu it's able to throw connection error if doltgres has trouble connecting. I set the config to incorrect info for troubleshooting after running the cli but no similar logs output found comparing to the ubuntu's.

zachmu commented 1 month ago

Hi @deathemperor,

From the first log sample, it looks like the replica is receiving replication events from Postgres, but then it can't send its response. Is there anything in the postgres log to indicate what's happening? We haven't seen this failure mode in any of our testing. What version of Postgres are you running? And what does the pg_replication_slots table have in it?

https://www.postgresql.org/docs/16/view-pg-replication-slots.html

We haven't tried running our replica tests on Mac yet, only Ubuntu. We'll see if we can reproduce that.

Also happy to help you debug this in real time on our discord, come on by whenever you have a chance.

https://discord.com/invite/RFwfYpu

deathemperor commented 1 month ago

hey @zachmu,

More details as requested:

Postgres log 2024-06-04 01:57:19.326 UTC [3147] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:21.869 UTC [3151] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:57:21.869 UTC [3151] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:57:21.869 UTC [3151] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:21.869 UTC [3151] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:57:21.869 UTC [3151] DETAIL: There are no running transactions. 2024-06-04 01:57:21.869 UTC [3151] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:31.868 UTC [3153] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:57:31.868 UTC [3153] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:57:31.868 UTC [3153] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:31.868 UTC [3153] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:57:31.868 UTC [3153] DETAIL: There are no running transactions. 2024-06-04 01:57:31.868 UTC [3153] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:41.876 UTC [3156] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:57:41.876 UTC [3156] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:57:41.876 UTC [3156] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:41.876 UTC [3156] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:57:41.876 UTC [3156] DETAIL: There are no running transactions. 2024-06-04 01:57:41.876 UTC [3156] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:51.884 UTC [3165] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:57:51.884 UTC [3165] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:57:51.884 UTC [3165] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:51.884 UTC [3165] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:57:51.884 UTC [3165] DETAIL: There are no running transactions. 2024-06-04 01:57:51.884 UTC [3165] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:57:56.899 UTC [3167] FATAL: role "root" does not exist 2024-06-04 01:58:03.732 UTC [3171] FATAL: role "root" does not exist 2024-06-04 01:58:11.892 UTC [3175] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:58:11.892 UTC [3175] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:58:11.892 UTC [3175] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:58:11.892 UTC [3175] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:58:11.892 UTC [3175] DETAIL: There are no running transactions. 2024-06-04 01:58:11.892 UTC [3175] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:58:31.900 UTC [3179] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:58:31.900 UTC [3179] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:58:31.900 UTC [3179] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:58:31.900 UTC [3179] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:58:31.900 UTC [3179] DETAIL: There are no running transactions. 2024-06-04 01:58:31.900 UTC [3179] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:58:41.907 UTC [3182] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:58:41.907 UTC [3182] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:58:41.907 UTC [3182] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:58:41.907 UTC [3182] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:58:41.907 UTC [3182] DETAIL: There are no running transactions. 2024-06-04 01:58:41.907 UTC [3182] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:58:51.913 UTC [3184] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:58:51.913 UTC [3184] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:58:51.913 UTC [3184] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:58:51.913 UTC [3184] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:58:51.913 UTC [3184] DETAIL: There are no running transactions. 2024-06-04 01:58:51.913 UTC [3184] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:59:01.920 UTC [3187] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:59:01.920 UTC [3187] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:59:01.920 UTC [3187] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:59:01.921 UTC [3187] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:59:01.921 UTC [3187] DETAIL: There are no running transactions. 2024-06-04 01:59:01.921 UTC [3187] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:59:11.927 UTC [3189] LOG: starting logical decoding for slot "doltgres_slot_test" 2024-06-04 01:59:11.927 UTC [3189] DETAIL: Streaming transactions committing after 0/1, reading WAL from 4/C6201FD8. 2024-06-04 01:59:11.927 UTC [3189] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true') 2024-06-04 01:59:11.927 UTC [3189] LOG: logical decoding found consistent point at 4/C6201FD8 2024-06-04 01:59:11.927 UTC [3189] DETAIL: There are no running transactions. 2024-06-04 01:59:11.927 UTC [3189] STATEMENT: START_REPLICATION SLOT doltgres_slot_test LOGICAL 0/1 (proto_version '2', publication_names 'doltgres_slot_test', messages 'true', streaming 'true')

Postgres version: PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

result of pg_replication_slots

postgres=# SELECT * FROM pg_stat_replication;
 pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time
------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
 3179 |       10 | postgres |                  | 172.23.0.1  |                 |       56616 | 2024-06-04 01:58:31.897029+00 |              | streaming | 4/C6202450 | 0/1       | 0/1       | 4/C6202451 |           |           |            |             0 | async      | 2024-06-04 01:58:31.900897+00
zachmu commented 1 month ago

Nothing here looks super suspicious to me. I think what we'll try here is making sending the standby messages more resilient, so replication doesn't die or restart when it can't send the standby messages. We'll have that out for you to try in a couple days.