codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
447 stars 177 forks source link

Node restarting causes cluster to crash #639

Open benshalev849 opened 1 year ago

benshalev849 commented 1 year ago

NOTE: This is to fix our issue and understand it more/understand if we are doing smthn wrong. ty for the help :) Seems to be similar if not exact thing (but with bigger cluster) as the following issue: https://github.com/codership/galera/issues/623 And this issue: https://github.com/codership/galera/issues/410

Recently we had a couple of problems with our galera cluster, we have added a 3rd region and to it 3 more nodes, (we used to have 3 nodes on 2 regions, 1 garbd on one of those regions.)

A few days a go the compute the VM was on crashed, when the node went back up it crashed the cluster with SST problems and caused the cluster to go down being READ-only and needing to be bootstrapped.

we are using : Galera 26.4.4 MariaDB 10.4.13

The configuration is as following and the same on all nodes (different ist.recv_bind ip and wsrep_node_address)

my.cnf:

[galera]
wsrep_on=ON
wsrep_cluster_name="powerdns"
binlog_format=ROW
default_storage_enginge=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
query_cache_size=0
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_address=gcomm://<9 ips of nodes>
wsrep_notify_cmd=/usr/bin/get-status.sh

wsrep_provider_options="gmcast.segment=<segment>; ist.recv_bind=<ip>; socket.ssl_cert=/etc/ssl/mysql/server-cert.pem;socket.ssl_key=/etc/ssl/mysql/server-key.pem;socket.ssl_ca=/etc/ssl/mysql/ca-cert.pem"
wsrep_dirty_reads=ON
wsrep-sync-wait=0
wsrep_node_address="<node_ip>"

[mysqld]
ssl-ca = /etc/ssl/mysq/ca-cert.pem
ssl-key = /etc/ssl/mysql/server-key.pem
ssl-ccert = /etc/ssl/mysql/server-cert.pem

[client]
ssl-ca = /etc/ssl/mysql/ca-cert.pem
ssl-key = /etc/ssl/mysql/client-key.pem
ssl-cert = /etc/ssl/mysql/client-cert.pem

The logs we see on the nodes that causes the crash: (JOINER nodes)

WSREP: Member 7.1 (db-<region-1>-1) request state transfer from '*any*'. Selected 6.1 (db-<region-1>-2)(SYNCED) as donor.
WSREP: Shifting PRIMARY -> JOINER (TO: 59319)
WSREP: Requesting state transfer: success, donor: 6
WSREP: forgetting f46bc950-abe6 (ssl://<ip>:4567)
version= 6,
component = PRIMARY,
conf_id = 75
members = 6/7 (joined/total),
act_id = 59324
last_appl. = 59214
protocols = 2/10/4 (gcs/repl/appl),
[Warning] WSREP: Donor f46bc950-9d7f-11ed-abe6-57fe7b2de322 is no longer in the group. State transfer cannot be completed, need to abort. Aborting
WSREP: /usr/bin/mysql: Terminated
systemd: mariadb.service: main process exited, code=killed, status=6/ABRT
mysqld: Terminated
WSREP_SST: [INFO] Joined cleanup. rsync PID:4389
rsyncd[4389]: sent 0 bytes recieved 0 bytes total size 0
mysql: WSREP_SST:[INFO] Joined cleanup done.
Failed to start MariaDB 10.4.13

The logs we see on the donor LOGS:

WSREP: Member 7.1 (db-<region-1>-1) request state transfer from '*any*'. Selected 6.1 (db-<region-1>-2)(SYNCED) as donor.
Shifting SYNCED -> DONOR/DESYNCED (TO: 59319)
WSREP: Detected STR version: 1, req_len: 120, req: STRv1
Cert index preload: 59215 -> 59319
IST sender using ssl
[ERROR] WSREP: Failed to process action STATE_REQUEST, g:59319, l:5187, ptr:0x7f6322974e78, size: 120: IST sender, failed to connect 'ssl://<server_ip>:4568': connect: No router to hose: 113 (No route to host)

Then after that the node continuned each one in the "line" of DONORS until he reached one that he didn't crash (the one we bootstrapped from).

The second time (after it restarts) we can see normal logs up until the following log: [Warning] WSREP: Donor <id> is no longer in the group. State transfer cannot be completed, need to abort. Aborting... This seems to be because the connecting node caused it to crash, then we see the same log on all of the other nodes that it crashes.

This already happened twice to us and causes alot of problems and downtime, what is the cause to this? why does this sometimes happen?

Why sometimes the node succeeds and is able to sync, and other times it goes 1 by 1 to the nodes and causes them to crash? Ty :)