2ndQuadrant / pglogical

Logical Replication extension for PostgreSQL 17, 16, 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
http://2ndquadrant.com/en/resources/pglogical/
Other
1.01k stars 153 forks source link

Subscriber throws connection to the other side died error #311

Open Viralshah009 opened 3 years ago

Viralshah009 commented 3 years ago

Hello, We currently have a provider subscriber setup on PostgreSQL 10.12 using pglogical 2.2.1. Off lately we are seeing the following error on the subscriber end:

< 2021-04-10 09:55:05.147 EDT pid=[249372-2], db=subscriber1, user=[unknown], app=pglogical apply 16401:3097708289, rhost=, tid=7/0 > ERROR: connection to other side has died < 2021-04-10 09:55:06.851 EDT pid=[249377-2], db=subscriber1, user=[unknown], app=pglogical apply 16401:3982767052, rhost=, tid=9/0 > ERROR: connection to other side has died < 2021-04-10 09:55:29.165 EDT pid=[249374-2], db=subscriber1, user=[unknown], app=pglogical apply 16401:706969817, rhost=, tid=13/0 > ERROR: connection to other side has died < 2021-04-10 09:55:33.872 EDT pid=[249376-2], db=subscriber1, user=[unknown], app=pglogical apply 16401:896296907, rhost=, tid=6/0 > ERROR: connection to other side has died < 2021-04-10 09:55:36.949 EDT pid=[249542-2], db=subscriber1, user=[unknown], app=pglogical apply 16401:1763399739, rhost=, tid=10/0 > ERROR: connection to other side has died

We have not been able to narrow down why this is happening. The spamming currently stops when we recreate the subscriptions. Can anyone help on understanding why this might be happening?

Thank you!

eulerto commented 3 years ago

It seems the connection between subscriber and provider has been closed. Did you test the connection subscriber -> provider when observing this issue? Are there related messages before the first ERROR message? Did you check the provider logs around that timestamp? Did it happen again after you recreated the subscriptions?

Viralshah009 commented 3 years ago

Hello Eulerto, there is no issue with the connection between the subscriber and provider. I am sharing my providers log here: < 2021-04-18 21:53:50.346 EDT pid=[41694-1], db=[unknown], user=[unknown], app=[unknown], rhost=10.198.168.0, tid= > LOG: connection received: host=10.198.168.0 port=58036 < 2021-04-18 21:53:50.354 EDT pid=[41694-2], db=prod, user=postgres, app=[unknown], rhost=10.198.168.0, tid=93/47646152 > LOG: replication connection authorized: user=postgres SSL enabled (protocol=TLSv1.2, cipher=ECDHE-ECDSA-AES256-GCM-SHA384, compression=off) < 2021-04-18 21:53:53.438 EDT pid=[41694-3], db=prod, user=postgres, app=pglogical_subscription_1, rhost=10.198.168.0, tid=93/0 > LOG: starting logical decoding for slot "pglogical_slot_1" < 2021-04-18 21:53:53.438 EDT pid=[41694-4], db=prod, user=postgres, app=pglogical_subscription_1, rhost=10.198.168.0, tid=93/0 > DETAIL: streaming transactions committing after 27A1/84A94FE8, reading WAL from 2799/D1005760 < 2021-04-18 21:53:53.461 EDT pid=[41694-5], db=prod, user=postgres, app=pglogical_subscription_1, rhost=10.198.168.0, tid=93/0 > LOG: logical decoding found consistent point at 2799/D1005760 < 2021-04-18 21:53:53.461 EDT pid=[41694-6], db=prod, user=postgres, app=pglogical_subscription_1, rhost=10.198.168.0, tid=93/0 > DETAIL: Logical decoding will begin using saved snapshot. < 2021-04-18 22:50:14.043 EDT pid=[41694-7], db=prod, user=postgres, app=pglogical_subscription_1, rhost=10.198.168.0, tid=93/47646155 > LOG: terminating walsender process due to replication timeout < 2021-04-18 22:50:14.043 EDT pid=[41694-8], db=prod, user=postgres, app=pglogical_subscription_1, rhost=10.198.168.0, tid=93/47646155 > CONTEXT: slot "pglogical_slot_1", output plugin "pglogical_output", in the commit callback, associated LSN 27A1/85551258 < 2021-04-18 22:50:14.044 EDT pid=[41694-9], db=prod, user=postgres, app=pglogical_subscription_1, rhost=10.198.168.0, tid= > LOG: disconnection: session time: 0:56:23.697 user=postgres database=prod host=10.198.168.0 port=58036

We have also recreated the subscriptions to stop this error. However, We see them again after our scheduled weekly db vacuum. It usually generates lots of WALs that needs to be send over. This wasn't a problem for a long time though.

eulerto commented 3 years ago

LOG: terminating walsender process due to replication timeout

You are hitting wal_sender_timeout. There is probably a long running transaction that has been decoded. Increase its value could probably solve this issue.

I strongly recommend that you upgrade Postgres to the latest minor version (current 10.16) since there are some fixes related to walsender in 10.14. You should also be using the latest pglogical version.