codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
450 stars 176 forks source link

WSREP: exception from gcomm, backend must be restarted failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL) #638

Open suizouwuya opened 1 year ago

suizouwuya commented 1 year ago

[version] Galera 25.3.33 Mariadb 10.3.30

[Background of the problem]

  1. Three nodes of mariadb node1: 1.1.1.21 node2: 1.1.1.22 node3: 1.1.1.23

[Problem scenario]

  1. Problem time: 2023-02-28 15:50:15
  2. Problem node: node2
  3. Last restart and synced time: (node2) 2023-02-27 20:07:44 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 10872869) Service was not restarted after 2023-02-27.
  4. Last shift log: (node1: 2023-02-28 14:08:26 0 [Note] WSREP: Restored state OPEN -> SYNCED (11343237) node2: 2023-02-28 14:08:26 0 [Note] WSREP: Restored state OPEN -> SYNCED (11343237) node3: 2023-02-28 14:08:26 0 [Note] WSREP: Restored state OPEN -> SYNCED (11343237) )

[My analysis]

  1. The problem may be related to the abnormal network environment.
  2. Tried many times, unable to reproduce

[Problem]

  1. node1 log node1.txt

  2. node2 log node2.txt

  3. node3 log node3.txt

  4. node2 error shows:

2023-02-28 15:50:08 0 [Note] WSREP: max install timeouts reached, will isolate node for PT20S
2023-02-28 15:50:08 0 [Note] WSREP: no install message received
2023-02-28 15:50:14 0 [Note] WSREP: (511505cf, 'tcp://1.1.1.22:9601') turning message relay requesting off
2023-02-28 15:50:15 0 [Warning] WSREP: evs::proto(511505cf, GATHER, view_id(REG,511505cf,28)) install timer expired
evs::proto(evs::proto(511505cf, GATHER, view_id(REG,511505cf,28)), GATHER) {
current_view=view(view_id(REG,511505cf,28) memb {
511505cf,0
8e061fa0,0
e71b5018,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=62468,safe_seq=62468,node_index=node: {idx=0,range=[62474,62473],safe_seq=62468} node: {idx=1,range=[62475,62474],safe_seq=62469} node: {idx=2,range=[62469,62468],safe_seq=62468} },
fifo_seq=1384760,
last_sent=62473,
known:
511505cf at 
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=62468,sr=-1,as=62468,f=0,src=511505cf,srcvid=view_id(REG,511505cf,28),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=1384760,nl=(
511505cf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,511505cf,28),ss=62468,ir=[62474,62473],}
8e061fa0, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,511505cf,28),ss=62469,ir=[62475,62474],}
e71b5018, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,511505cf,28),ss=62468,ir=[62469,62468],}
)
},
}
8e061fa0 at tcp://1.1.1.21:9601
{o=0,s=1,i=0,fs=6212585,}
e71b5018 at tcp://1.1.1.23:9601
{o=0,s=1,i=0,fs=5108065,}
}
2023-02-28 15:50:15 0 [Note] WSREP: going to give up, state dump for diagnosis:
evs::proto(evs::proto(511505cf, GATHER, view_id(REG,511505cf,28)), GATHER) {
current_view=view(view_id(REG,511505cf,28) memb {
511505cf,0
8e061fa0,0
e71b5018,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=62468,safe_seq=62468,node_index=node: {idx=0,range=[62474,62473],safe_seq=62468} node: {idx=1,range=[62475,62474],safe_seq=62469} node: {idx=2,range=[62469,62468],safe_seq=62468} },
fifo_seq=1384760,
last_sent=62473,
known:
511505cf at 
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=62468,sr=-1,as=62468,f=0,src=511505cf,srcvid=view_id(REG,511505cf,28),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=1384760,nl=(
511505cf, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,511505cf,28),ss=62468,ir=[62474,62473],}
8e061fa0, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,511505cf,28),ss=62469,ir=[62475,62474],}
e71b5018, {o=0,s=1,e=0,ls=-1,vid=view_id(REG,511505cf,28),ss=62468,ir=[62469,62468],}
)
},
}
8e061fa0 at tcp://1.1.1.21:9601
{o=0,s=1,i=0,fs=6212585,}
e71b5018 at tcp://1.1.1.23:9601
{o=0,s=1,i=0,fs=5108065,}
}
2023-02-28 15:50:15 0 [ERROR] WSREP: exception from gcomm, backend must be restarted: evs::proto(511505cf, GATHER, view_id(REG,511505cf,28)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
at gcomm/src/evs_proto.cpp:handle_install_timer():727
2023-02-28 15:50:15 0 [Note] WSREP: gcomm: terminating thread
2023-02-28 15:50:15 0 [Note] WSREP: gcomm: joining thread
2023-02-28 15:50:15 0 [Note] WSREP: gcomm: closing backend
2023-02-28 15:50:15 0 [Note] WSREP: Forced PC close
2023-02-28 15:50:15 0 [Warning] WSREP: discarding 3 messages from message index
2023-02-28 15:50:15 0 [Note] WSREP: gcomm: closed
2023-02-28 15:50:15 0 [Note] WSREP: Received self-leave message.
2023-02-28 15:50:15 0 [Note] WSREP: comp msg error in core 103
2023-02-28 15:50:15 0 [Note] WSREP: Closing send monitor...
2023-02-28 15:50:15 0 [Note] WSREP: Closed send monitor.
2023-02-28 15:50:15 0 [Note] WSREP: Closing replication queue.
2023-02-28 15:50:15 0 [Note] WSREP: Closing slave action queue.
2023-02-28 15:50:15 2 [Note] WSREP: New cluster view: global state: 00000000-0000-0000-0000-000000000000:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version -1
2023-02-28 15:50:15 0 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 11391296)
2023-02-28 15:50:15 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-02-28 15:50:15 0 [Note] WSREP: RECV thread exiting -103: Software caused connection abort
2023-02-28 15:50:15 2 [Note] WSREP: applier thread exiting (code:6)
  1. I double-checked the issues list, which is somewhat similar to this one, but the version I'm using should have fixed it. https://github.com/codership/galera/issues/202 https://github.com/codership/galera/issues/40