codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
447 stars 177 forks source link

about pc.ignore_sb = true, I added this parameter to keep working in case of split brain #643

Open WarnAndFine opened 1 year ago

WarnAndFine commented 1 year ago

I saw the documentation on this parameter. I have three machines that form a cluster. All three machines are configured with this parameter. I turned off the network card of one of the machines and found that its status changed from primary to non-primary, and it cannot Read and write, will report an error

ERROR 1047 (08S01): WSREP has not yet prepared node for application use, use the following command to query show variables like 'wsrep_provider_options'; pc.ignore_sb= true

environment:

mariadb:mariadb-10.11.2-linux-systemd-x86_64 galera:galera-4-26.4.14-1.el7.centos.x86_64 system: Linux 3.10.0-1127.el7.x86_64 centos7

shutdown interface log:

2023-07-13 15:34:01 0 [Note] WSREP: (c918c367-908e, 'tcp://0.0.0.0:4567') connection to peer b28c6e35-ab02 with addr tcp://192.168.1.179:4567 timed out, no messages seen in PT3S, socket stats: rtt: 8982 rttvar: 15039 rto: 209000 lost: 0 last_data_recv: 3323 cwnd: 10 last_queued_since: 322666886 last_delivered_since: 3322090455 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 2023-07-13 15:34:01 0 [Note] WSREP: (c918c367-908e, 'tcp://0.0.0.0:4567') connection to peer 4c100681-8496 with addr tcp://192.168.1.74:4567 timed out, no messages seen in PT3S, socket stats: rtt: 9888 rttvar: 16262 rto: 210000 lost: 0 last_data_recv: 3323 cwnd: 10 last_queued_since: 6209 last_delivered_since: 3322191023 send_queue_length: 1 send_queue_bytes: 212 segment: 0 messages: 1 2023-07-13 15:34:01 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://192.168.1.74:4567 2023-07-13 15:34:01 0 [Note] WSREP: (c918c367-908e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.1.179:4567 tcp://192.168.1.74:4567 2023-07-13 15:34:01 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7fa5200019e8 2023-07-13 15:34:01 0 [Note] WSREP: Deferred close timer destruct 2023-07-13 15:34:03 0 [Note] WSREP: (c918c367-908e, 'tcp://0.0.0.0:4567') reconnecting to b28c6e35-ab02 (tcp://192.168.1.179:4567), attempt 0 2023-07-13 15:34:03 0 [Note] WSREP: (c918c367-908e, 'tcp://0.0.0.0:4567') reconnecting to 4c100681-8496 (tcp://192.168.1.74:4567), attempt 0 2023-07-13 15:34:03 0 [Note] WSREP: evs::proto(c918c367-908e, OPERATIONAL, view_id(REG,4c100681-8496,20)) suspecting node: 4c100681-8496 2023-07-13 15:34:03 0 [Note] WSREP: evs::proto(c918c367-908e, OPERATIONAL, view_id(REG,4c100681-8496,20)) suspected node without join message, declaring inactive 2023-07-13 15:34:03 0 [Note] WSREP: evs::proto(c918c367-908e, OPERATIONAL, view_id(REG,4c100681-8496,20)) suspecting node: b28c6e35-ab02 2023-07-13 15:34:03 0 [Note] WSREP: evs::proto(c918c367-908e, OPERATIONAL, view_id(REG,4c100681-8496,20)) suspected node without join message, declaring inactive 2023-07-13 15:34:04 0 [Note] WSREP: view(view_id(NON_PRIM,4c100681-8496,20) memb { c918c367-908e,0 } joined { } left { } partitioned { 4c100681-8496,0 b28c6e35-ab02,0 }) 2023-07-13 15:34:04 0 [Note] WSREP: view(view_id(NON_PRIM,c918c367-908e,21) memb { c918c367-908e,0 } joined { } left { } partitioned { 4c100681-8496,0 b28c6e35-ab02,0 }) 2023-07-13 15:34:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-13 15:34:04 0 [Note] WSREP: Flow-control interval: [16, 16] 2023-07-13 15:34:04 0 [Note] WSREP: Received NON-PRIMARY. 2023-07-13 15:34:04 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 1089) 2023-07-13 15:34:04 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2023-07-13 15:34:04 0 [Note] WSREP: Flow-control interval: [16, 16] 2023-07-13 15:34:04 0 [Note] WSREP: Received NON-PRIMARY. 2023-07-13 15:34:04 2 [Note] WSREP: ================================================ View: id: 16797e05-1ed5-11ee-bc52-9b6c8a07bc93:1089 status: non-primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 0 members(1): 0: c918c367-2149-11ee-908e-b7c35a520d3d, test2

2023-07-13 15:34:04 2 [Note] WSREP: Non-primary view 2023-07-13 15:34:04 2 [Note] WSREP: Server status change synced -> connected 2023-07-13 15:34:04 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2023-07-13 15:34:04 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2023-07-13 15:34:04 2 [Note] WSREP: ================================================ View: id: 16797e05-1ed5-11ee-bc52-9b6c8a07bc93:1089 status: non-primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 0 members(1): 0: c918c367-2149-11ee-908e-b7c35a520d3d, test2

2023-07-13 15:34:04 2 [Note] WSREP: Non-primary view 2023-07-13 15:34:04 2 [Note] WSREP: Server status change connected -> connected 2023-07-13 15:34:04 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2023-07-13 15:34:04 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2023-07-13 15:34:24 12 [ERROR] Slave I/O: error connecting to master 'repl@192.168.1.179:3306' - retry-time: 60 maximum-retries: 100000 message: Can't connect to server on '192.168.1.179' (101 "Network is unreachable"), Internal MariaDB error code: 2003

My configuration file is as follows:

server.cnf [mysqld] datadir=/home/mariadb_data socket=/usr/local/mariadb/socket/mysql.sock bind-address=0.0.0.0 user=mysql default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 binlog_format=ROW log-error=/usr/local/mariadb/log/mysqld.log [galera] wsrep_on=ON wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so wsrep_node_name='test2' wsrep_node_address="192.168.1.193" wsrep_cluster_name='galera-cluster' wsrep_cluster_address="gcomm://192.168.1.179,191.168.1.193,192.168.1.74" wsrep_provider_options="gcache.size=1G" wsrep_slave_threads=4 wsrep_sst_method=rsync wsrep_provider_options="pc.ignore_sb=TRUE"

pc.ignore_sb=true

wsrep_provider_options="pc.ignore_quorum=true"

My question is that when the brain is split, other nodes can also insert and modify data, but it is not possible according to the configuration. If it is convenient, please give some suggestions or references