Open deathemperor opened 6 months ago
Hi @deathemperor,
From the first log sample, it looks like the replica is receiving replication events from Postgres, but then it can't send its response. Is there anything in the postgres log to indicate what's happening? We haven't seen this failure mode in any of our testing. What version of Postgres are you running? And what does the pg_replication_slots
table have in it?
https://www.postgresql.org/docs/16/view-pg-replication-slots.html
We haven't tried running our replica tests on Mac yet, only Ubuntu. We'll see if we can reproduce that.
Also happy to help you debug this in real time on our discord, come on by whenever you have a chance.
hey @zachmu,
More details as requested:
Postgres version:
PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
result of pg_replication_slots
postgres=# SELECT * FROM pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
3179 | 10 | postgres | | 172.23.0.1 | | 56616 | 2024-06-04 01:58:31.897029+00 | | streaming | 4/C6202450 | 0/1 | 0/1 | 4/C6202451 | | | | 0 | async | 2024-06-04 01:58:31.900897+00
Nothing here looks super suspicious to me. I think what we'll try here is making sending the standby messages more resilient, so replication doesn't die or restart when it can't send the standby messages. We'll have that out for you to try in a couple days.
@zachmu any update?
I was able to set up and got it running on my machine but the same error still occurs on my development server.
specifically the Error: write failed: write tcp 127.0.0.1:34974->127.0.0.1:5436: i/o timeout. Retrying
is the same from previous message. 5436 is the postgres main I'm trying to replicate from
config.yaml
listener:
host: 127.0.0.1
port: 5455
postgres_replication:
postgres_server_address: localhost
postgres_user: postgres
postgres_password: 123456
postgres_database: postgres
postgres_port: 5436
slot_name: doltgres_slot_test
behavior:
dolt_transaction_commit: true
read_only: false
log_level: debug
I've tried with 2 types of OS without success, each with its own problem. Both are doltgres 0.7.5
Ubuntu 20.04:
config.yaml:
log of running
doltgres --data-dir /vol2/dolgresql/data
I followed replication instruction here: https://docs.doltgres.com/guides/replication-from-postgres. Trying to test with
employees
table=================================
MacOS M3 Max version 14.5
config.yaml:
logs of running
doltgres -config config.yaml
It's supposed to throw error because the
postgres_server_address
and its info is incorrect, as well as slot name. In Ubuntu it's able to throw connection error if doltgres has trouble connecting. I set the config to incorrect info for troubleshooting after running the cli but no similar logs output found comparing to the ubuntu's.