codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
447 stars 177 forks source link

Use both IPv4 and IPv6 between nodes #657

Open anthonyryan1 opened 4 months ago

anthonyryan1 commented 4 months ago

This is a bit of an observation / suggestion rather than a bug.

I have a Galera WAN cluster, to provide high availability between a 3 distinct data centers / availability zones. One of the data centers lost IPv6 connectivity to the other two, but still had functioning IPv4 connectivity.

I noticed that Galera kept repeatedly attempting to connect to the IPv6 address until it timed out and gave up.

2024-04-19  5:38:53 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[beef:beef:beef:beef::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327670644 cwnd: 1 last_queued_since: 6622946029112785 last_delivered_since: 6622946029112785 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:38:53 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[cafe:cafe:cafe:cafe::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327670644 cwnd: 1 last_queued_since: 6622946029243485 last_delivered_since: 6622946029243485 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:38:57 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[beef:beef:beef:beef::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327674644 cwnd: 1 last_queued_since: 6622950029589363 last_delivered_since: 6622950029589363 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:00 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[cafe:cafe:cafe:cafe::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327677644 cwnd: 1 last_queued_since: 6622953029963392 last_delivered_since: 6622953029963392 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:03 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[beef:beef:beef:beef::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2327681154 cwnd: 1 last_queued_since: 6622956530264078 last_delivered_since: 6622956530264078 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:06 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[cafe:cafe:cafe:cafe::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327684154 cwnd: 1 last_queued_since: 6622959530569978 last_delivered_since: 6622959530569978 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:09 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[beef:beef:beef:beef::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327687154 cwnd: 1 last_queued_since: 6622962530838950 last_delivered_since: 6622962530838950 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:13 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[cafe:cafe:cafe:cafe::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2327690654 cwnd: 1 last_queued_since: 6622966031075340 last_delivered_since: 6622966031075340 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:16 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[beef:beef:beef:beef::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327693654 cwnd: 1 last_queued_since: 6622969031345969 last_delivered_since: 6622969031345969 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:19 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[cafe:cafe:cafe:cafe::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2327697154 cwnd: 1 last_queued_since: 6622972531609620 last_delivered_since: 6622972531609620 send_queue_length: 0 send_queue_bytes: 0
2024-04-19  5:39:22 0 [Note] WSREP: (19a9c908-9f93, 'ssl://[::]:4567') connection to peer 00000000-0000 with addr ssl://[beef:beef:beef:beef::1]:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2327700154 cwnd: 1 last_queued_since: 6622975531888988 last_delivered_since: 6622975531888988 send_queue_length: 0 send_queue_bytes: 0

I observed this using:

I think it would be a nice improvement to try both the IPv4 and IPv6 IPs when both are available. It would help reconnect in a partial outage scenario like I observed.

It would also be interesting if Galera could maintain multiple open connections between two nodes, but that may be annoying from a code perspective.

Perhaps the simplest solution is to just Multipath TCP which could still present Galera with an identical input/output experience, while handling connection redundancy at the protocol level.