Electric pg connection closing on large syncs

js2702 commented 11 months ago

We are doing some tests with large quantities of data (10000-15000 new rows) on a foreign key related table (so compensations messages are sent). What we've encountered is that sometimes the Electric service will complain about the Postgres connection being closed. We've been progressively increasing the number of rows to test and when reaching 10K it may or may not fail. When it fails it's possible that it tries again and then it gets synced correctly. But if we keep increasing the number of rows it starts failing consistently.

The tests are on Electric server and client 0.6.4 and we ran them on a macOS machine (Docker) and on a Linux server. It happens on both of them.

Electric logs

manabox_sync-electric-1  | 08:35:29.239 pid=<0.2824.0> origin=postgres_1 pg_slot=postgres_1 [debug] Sending 60010 messages to the subscriber: from #Lsn<0/75AA9> to #Lsn<0/108291>
manabox_sync-electric-1  | 08:36:30.160 pid=<0.2824.0> origin=postgres_1 pg_slot=postgres_1 [error] GenServer #PID<0.2824.0> terminating
manabox_sync-electric-1  | ** (MatchError) no match of right hand side value: {:error, :closed}
manabox_sync-electric-1  |     (electric 0.6.4) lib/electric/replication/postgres/tcp_server.ex:620: Electric.Replication.Postgres.TcpServer.tcp_send/2
manabox_sync-electric-1  |     (elixir 1.15.4) lib/enum.ex:984: Enum."-each/2-lists^foreach/1-0-"/2
manabox_sync-electric-1  |     (electric 0.6.4) lib/electric/replication/postgres/slot_server.ex:321: Electric.Replication.Postgres.SlotServer.send_transaction/3
manabox_sync-electric-1  |     (elixir 1.15.4) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
manabox_sync-electric-1  |     (electric 0.6.4) lib/electric/replication/postgres/slot_server.ex:275: Electric.Replication.Postgres.SlotServer.handle_events/3
manabox_sync-electric-1  |     (gen_stage 1.2.1) lib/gen_stage.ex:2578: GenStage.consumer_dispatch/6
manabox_sync-electric-1  |     (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
manabox_sync-electric-1  |     (stdlib 4.3.1.2) gen_server.erl:1200: :gen_server.handle_msg/6
... // Last Message
manabox_sync-electric-1  | 08:36:30.176 pid=<0.2905.0> origin=postgres_1 pg_slot=postgres_1 [debug] slot server started, registered as {:n, :l, {Electric.Replication.Postgres.SlotServer, "postgres_1"}} and {:n, :l, {Electric.Replication.Postgres.SlotServer, {:slot_name, "postgres_1"}}}

Postgres logs

manabox_sync-postgres-1  | 2023-10-03 08:37:04.899 GMT [184] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
manabox_sync-postgres-1  |              This probably means the server terminated abnormally
manabox_sync-postgres-1  |              before or while processing the request.
manabox_sync-postgres-1  | 2023-10-03 08:37:04.901 GMT [1] LOG:  background worker "logical replication worker" (PID 184) exited with exit code 1
manabox_sync-postgres-1  | 2023-10-03 08:37:04.902 GMT [298] LOG:  logical replication apply worker for subscription "postgres_1" has started

Extra

Kinda on topic, would there be any difference between an user syncing 10K oplogs and 1K users syncing 10 oplogs? In terms of server performance. If you know any tool we could use to test a higher number of users it would be great to hear.

alco commented 11 months ago

Hey @js2702. Thanks a lot for sharing your findings!

We have a load-testing/perf-analysis project on our roadmap, haven't quite got there just yet. Dealing with large amounts of data is definitely something that can be improved by using bulk operations and more compact subprotocol for data transfer between the client and the server and between Electric and PG.

Kinda on topic, would there be any difference between an user syncing 10K oplogs and 1K users syncing 10 oplogs? In terms of server performance.

In theory, there shouldn't be a difference. Electric fans-in all incoming client writes into a single stream that is then fed into PG via logical replication.

If you know any tool we could use to test a higher number of users it would be great to hear.

Could you share some details about your current toolset you're using to run those tests?

js2702 commented 11 months ago

Right now we are using a script that uses a part of our application to mass import csv files.

To check the performance and network bandwidth we are using Cadvisor and Prometheus for analytics.

version: "3.8"
name: docker_metrics

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.2
    privileged: true
    devices:
      - "/dev/kmsg"
    ports:
      - 8080:8080

    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - 9090:9090
    command:
      - --config.file=/etc/prometheus/prometheus.yml
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    depends_on:
      - cadvisor

And the prometheus.yml config

scrape_configs:
  - job_name: cadvisor
    scrape_interval: 5s
    static_configs:
      - targets:
          - cadvisor:8080

We are measuring outcoming bytes from the Electric container and incoming bytes into the Postgres container. Substracting one from the other to obtain an approximate number of what a hosting provider like GCP could charge for the egress.

Prometheus queries increase(container_network_receive_bytes_total{name="postgres-1"}[30s]) increase(container_network_transmit_bytes_total{name="electric-1"}[30s])

alco commented 11 months ago

@js2702 Thank you for those details!

KyleAMathews commented 1 month ago

👋 we've been working the last month on a rebuild of the Electric server over at a temporary repo https://github.com/electric-sql/electric-next/

You can read more about why we made the decision at https://next.electric-sql.com/about

We're really excited about all the new possibilities the new server brings and we hope you'll check it out soon and give us your feedback.

We're now moving the temporary repo back here. As part of that migration we're closing all the old issues and PRs. We really appreciate you taking the time to investigate and report this issue!

electric-sql / electric

Electric pg connection closing on large syncs #519

Extra