codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
448 stars 176 forks source link

garbd cluster backup destroys whole cluster #587

Open mpw-wwu opened 3 years ago

mpw-wwu commented 3 years ago

While trying to implement a cluster backup with garb, a reproducible full cluster breakdown happens.

$ cat /etc/garbd.cnf 
group=cluster1
address=gcomm://10.14.28.44:4567,10.14.29.34:4567,10.14.28.184:4567
sst=backup_mysqldump
sudo garbd --cfg /etc/garbd.cnf
2021-01-08 11:13:18.634  INFO: CRC-32C: using hardware acceleration.
2021-01-08 11:13:18.634  INFO: Read config: 
    daemon:  0
    name:    garb
    address: gcomm://10.14.28.44:4567,10.14.29.34:4567,10.14.28.184:4567
    group:   cluster1
    sst:     backup_mysqldump
    donor:   
    options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes
    cfg:     /etc/garbd.cnf
    log:     

2021-01-08 11:13:18.635  INFO: protonet asio version 0
2021-01-08 11:13:18.635  INFO: Using CRC-32C for message checksums.
2021-01-08 11:13:18.635  INFO: backend: asio
2021-01-08 11:13:18.635  INFO: gcomm thread scheduling priority set to other:0 
2021-01-08 11:13:18.635  WARN: access file(./gvwstate.dat) failed(No such file or directory)
2021-01-08 11:13:18.636  INFO: restore pc from disk failed
2021-01-08 11:13:18.636  INFO: GMCast version 0
2021-01-08 11:13:18.636  INFO: (82d3296d, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2021-01-08 11:13:18.636  INFO: (82d3296d, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2021-01-08 11:13:18.636  INFO: EVS version 0
2021-01-08 11:13:18.637  INFO: gcomm: connecting to group 'cluster1', peer '10.14.28.44:4567,10.14.29.34:4567,10.14.28.184:4567'
2021-01-08 11:13:18.642  INFO: (82d3296d, 'tcp://0.0.0.0:4567') connection established to 0c71666e tcp://10.14.28.44:4567
2021-01-08 11:13:18.642  INFO: (82d3296d, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2021-01-08 11:13:18.649  INFO: (82d3296d, 'tcp://0.0.0.0:4567') connection established to 10d16c7b tcp://10.14.29.34:4567
2021-01-08 11:13:18.650  INFO: (82d3296d, 'tcp://0.0.0.0:4567') connection established to 06a0398c tcp://10.14.28.184:4567
2021-01-08 11:13:18.908  INFO: declaring 06a0398c at tcp://10.14.28.184:4567 stable
2021-01-08 11:13:18.908  INFO: declaring 0c71666e at tcp://10.14.28.44:4567 stable
2021-01-08 11:13:18.908  INFO: declaring 10d16c7b at tcp://10.14.29.34:4567 stable
2021-01-08 11:13:19.912  INFO: Node 06a0398c state prim
2021-01-08 11:13:19.914  INFO: view(view_id(PRIM,06a0398c,4) memb {
    06a0398c,0
    0c71666e,0
    10d16c7b,0
    82d3296d,0
} joined {
} left {
} partitioned {
})
2021-01-08 11:13:19.915  INFO: save pc into disk
2021-01-08 11:13:20.137  INFO: gcomm: connected
2021-01-08 11:13:20.137  INFO: Changing maximum packet size to 64500, resulting msg size: 32636
2021-01-08 11:13:20.137  INFO: Shifting CLOSED -> OPEN (TO: 0)
2021-01-08 11:13:20.137  INFO: Opened channel 'cluster1'
2021-01-08 11:13:20.138  INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 3, memb_num = 4
2021-01-08 11:13:20.138  INFO: STATE EXCHANGE: Waiting for state UUID.
2021-01-08 11:13:20.138  INFO: STATE EXCHANGE: sent state msg: 83975b22-51a2-11eb-b634-4bb2724649ce
2021-01-08 11:13:20.138  INFO: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 1 (knotena)
2021-01-08 11:13:20.138  INFO: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 2 (knotenb)
2021-01-08 11:13:20.138  INFO: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 0 (knotenc)
2021-01-08 11:13:20.142  INFO: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 3 (garb)
2021-01-08 11:13:20.142  INFO: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 3,
    members    = 3/4 (joined/total),
    act_id     = 4,
    last_appl. = -1,
    protocols  = 0/10/4 (gcs/repl/appl),
    group UUID = 06a2e5c0-51a2-11eb-b317-26644a232db7
2021-01-08 11:13:20.142  INFO: Flow-control interval: [9999999, 9999999]
2021-01-08 11:13:20.142  INFO: Trying to continue unpaused monitor
2021-01-08 11:13:20.142  INFO: Shifting OPEN -> PRIMARY (TO: 4)
2021-01-08 11:13:20.142  INFO: Sending state transfer request: 'backup_mysqldump', size: 16
2021-01-08 11:13:20.148  INFO: Member 3.0 (garb) requested state transfer from '*any*'. Selected 0.0 (knotenc)(SYNCED) as donor.
2021-01-08 11:13:20.148  INFO: Shifting PRIMARY -> JOINER (TO: 4)
2021-01-08 11:13:20.148  INFO: Closing send monitor...
2021-01-08 11:13:20.148  INFO: Closed send monitor.
2021-01-08 11:13:20.148  INFO: gcomm: terminating thread
2021-01-08 11:13:20.148  INFO: gcomm: joining thread
2021-01-08 11:13:20.148  INFO: gcomm: closing backend
2021-01-08 11:13:20.152  INFO: view(view_id(NON_PRIM,06a0398c,4) memb {
    82d3296d,0
} joined {
} left {
} partitioned {
    06a0398c,0
    0c71666e,0
    10d16c7b,0
})
2021-01-08 11:13:20.152  INFO: view((empty))
2021-01-08 11:13:20.152  INFO: gcomm: closed
2021-01-08 11:13:20.152  INFO: 3.0 (garb): State transfer from 0.0 (knotenc) complete.
2021-01-08 11:13:20.152  INFO: Shifting JOINER -> JOINED (TO: 4)
2021-01-08 11:13:20.152  WARN: 0x565193254808 down context(s) not set
2021-01-08 11:13:20.152  WARN: Failed to send SYNC signal: -107 (Transport endpoint is not connected)
2021-01-08 11:13:20.152  INFO: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2021-01-08 11:13:20.152  INFO: Flow-control interval: [9999999, 9999999]
2021-01-08 11:13:20.152  INFO: Trying to continue unpaused monitor
2021-01-08 11:13:20.152  INFO: Received NON-PRIMARY.
2021-01-08 11:13:20.152  INFO: Shifting JOINED -> OPEN (TO: 4)
2021-01-08 11:13:20.152  INFO: Received self-leave message.
2021-01-08 11:13:20.152  INFO: Flow-control interval: [9999999, 9999999]
2021-01-08 11:13:20.152  INFO: Trying to continue unpaused monitor
2021-01-08 11:13:20.152  INFO: Received SELF-LEAVE. Closing connection.
2021-01-08 11:13:20.152  INFO: Shifting OPEN -> CLOSED (TO: 4)
2021-01-08 11:13:20.152  INFO: RECV thread exiting 0: Success
2021-01-08 11:13:20.153  INFO: recv_thread() joined.
2021-01-08 11:13:20.153  INFO: Closing replication queue.
2021-01-08 11:13:20.153  INFO: Closing slave action queue.
2021-01-08 11:13:20.153  WARN: Attempt to close a closed connection
2021-01-08 11:13:20.153  INFO: Exiting main loop
2021-01-08 11:13:20.153  INFO: Shifting CLOSED -> DESTROYED (TO: 4)

Backup didn't work, but that's not the issue here. The real and very serious problem in my opinion is, that the whole cluster is stuck and in unsynced state.

Syslog from node A (there are three in total)

Jan  8 12:13:18 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:18 0 [Note] WSREP: (0c71666e-9ddf, 'tcp://0.0.0.0:4567') connection established to 82d3296d-8cb0 tcp://10.14.28.179:4567
Jan  8 12:13:18 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:18 0 [Note] WSREP: (0c71666e-9ddf, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
Jan  8 12:13:18 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:18 0 [Note] WSREP: EVS version downgrade 1 -> 0
Jan  8 12:13:18 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:18 0 [Note] WSREP: declaring 06a0398c-95db at tcp://10.14.28.184:4567 stable
Jan  8 12:13:18 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:18 0 [Note] WSREP: declaring 10d16c7b-9aa6 at tcp://10.14.29.34:4567 stable
Jan  8 12:13:18 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:18 0 [Note] WSREP: declaring 82d3296d-8cb0 at tcp://10.14.28.179:4567 stable
Jan  8 12:13:18 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:18 0 [Note] WSREP: PC protocol downgrade 1 -> 0
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: Node 06a0398c-95db state prim
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: view(view_id(PRIM,06a0398c-95db,4) memb {
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: #01106a0398c-95db,0
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: #0110c71666e-9ddf,0
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: #01110d16c7b-9aa6,0
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: #01182d3296d-8cb0,0
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: } joined {
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: } left {
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: } partitioned {
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: })
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: save pc into disk
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 4
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 83975b22-51a2-11eb-b634-4bb2724649ce
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 1 (knotena)
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 2 (knotenb)
Jan  8 12:13:19 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:19 0 [Note] WSREP: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 0 (knotenc)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: STATE EXCHANGE: got state msg: 83975b22-51a2-11eb-b634-4bb2724649ce from 3 (garb)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Quorum results:
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011version    = 4,
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011component  = PRIMARY,
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011conf_id    = 3,
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011members    = 3/4 (joined/total),
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011act_id     = 4,
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011last_appl. = 0,
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011protocols  = 0/10/4 (gcs/repl/appl),
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011vote policy= 1,
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011group UUID = 06a2e5c0-51a2-11eb-b317-26644a232db7
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Flow-control interval: [32, 32]
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: ####### processing CC 4, local, ordered
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: ####### My UUID: 0c71666e-51a2-11eb-9ddf-8651c5bbc249
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Skipping cert index reset
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: REPL Protocols: 10 (5)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: ####### Adjusting cert position: 4 -> 4
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Service thread queue flushed.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Member 3.0 (garb) requested state transfer from '*any*'. Selected 0.0 (knotenc)(SYNCED) as donor.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: 3.0 (garb): State transfer from 0.0 (knotenc) complete.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [ERROR] WSREP: Failed to process action CONFIGURATION, g: 1, l: 7, ptr: 0x7fe53c001a50, size: 296: Attempt to reuse the same seqno: 4. New buffer: addr: 0x7fe53c001a38, seqno: 0, size: 320, ctx: 0x55ae227b2d20, flags: 0. store: 1, type: 0, previous buffer: addr: 0x7fe53c001828, seqno: 4, size: 528, ctx: 0x55ae227b2d20, flags: 0. store: 1, type: 0 (FATAL)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #011 at gcache/src/GCache_seqno.cpp:seqno_assign():86
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: EVS version upgrade 0 -> 1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: declaring 06a0398c-95db at tcp://10.14.28.184:4567 stable
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: declaring 10d16c7b-9aa6 at tcp://10.14.29.34:4567 stable
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: forgetting 82d3296d-8cb0 (tcp://10.14.28.179:4567)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: PC protocol upgrade 0 -> 1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Node 06a0398c-95db state prim
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: declaring 10d16c7b-9aa6 at tcp://10.14.29.34:4567 stable
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: forgetting 06a0398c-95db (tcp://10.14.28.184:4567)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Node 0c71666e-9ddf state prim
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Warning] WSREP: 0c71666e-9ddf sending install message failed: Resource temporarily unavailable
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: view(view_id(NON_PRIM,0c71666e-9ddf,6) memb {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #0110c71666e-9ddf,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: } joined {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: } left {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: } partitioned {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #01106a0398c-95db,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #01110d16c7b-9aa6,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #01182d3296d-8cb0,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: })
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Flow-control interval: [16, 16]
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Received NON-PRIMARY.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 4)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: forgetting 10d16c7b-9aa6 (tcp://10.14.29.34:4567)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: view(view_id(NON_PRIM,0c71666e-9ddf,7) memb {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #0110c71666e-9ddf,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: } joined {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: } left {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: } partitioned {
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #01106a0398c-95db,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #01110d16c7b-9aa6,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: #01182d3296d-8cb0,0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: })
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Flow-control interval: [16, 16]
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Received NON-PRIMARY.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Closing send monitor...
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Closed send monitor.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: gcomm: terminating thread
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: gcomm: joining thread
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: gcomm: closing backend
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: PC protocol downgrade 1 -> 0
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: view((empty))
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: gcomm: closed
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: New SELF-LEAVE.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Flow-control interval: [0, 0]
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 4)
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: RECV thread exiting 0: Success
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: recv_thread() joined.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Closing replication queue.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Closing slave action queue.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: ================================================
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: View:
Jan  8 12:13:20 galera-knoten-a mariadbd[913]:   id: 00000000-0000-0000-0000-000000000000:-1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]:   status: non-primary
Jan  8 12:13:20 galera-knoten-a mariadbd[913]:   protocol_version: -1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]:   capabilities:
Jan  8 12:13:20 galera-knoten-a mariadbd[913]:   final: yes
Jan  8 12:13:20 galera-knoten-a mariadbd[913]:   own_index: -1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]:   members(0):
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: =================================================
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Non-primary view
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Server status change synced -> disconnected
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 0 [Note] WSREP: Service thread queue flushed.
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Note] WSREP: Applier thread exiting ret: 8 thd: 1
Jan  8 12:13:20 galera-knoten-a mariadbd[913]: 2021-01-08 12:13:20 1 [Warning] Aborted connection 1 to db: 'unconnected' user: 'unauthenticated' host: '' (This connection closed normally without authentication)
Jan  8 12:13:24 galera-knoten-a ntpd[651]: Soliciting pool server 78.46.53.8
Jan  8 12:13:27 galera-knoten-a ntpd[651]: Soliciting pool server 138.201.90.189
Jan  8 12:13:31 galera-knoten-a ntpd[651]: Soliciting pool server 213.239.239.166
Jan  8 12:13:40 galera-knoten-a ntpd[651]: Soliciting pool server 2001:638:502:c015::232
Jan  8 12:13:41 galera-knoten-a ntpd[651]: Soliciting pool server 2a02:c207:2010:9464::1

The cluster remains dysfunctional and needs a restart of all machines and manual chose of primary node.

Any ideas what went wrong to end up in such a disaster?

Best, Matthias

Shinken75 commented 3 years ago

I have the same issue Could help me please, my nodes change status to disconnected juste after adding garb to cluster.

Below logs for one node (i have a cluster with two nodes, and there no issue with network/firewall)

2021-08-19 15:25:32 0 [Note] WSREP: (b0966f52-a3d9, 'tcp://0.0.0.0:4567') connection established to 547622d1-ae8a tcp://172.30.162.136:4567 2021-08-19 15:25:32 0 [Note] WSREP: (b0966f52-a3d9, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 2021-08-19 15:25:32 0 [Note] WSREP: EVS version downgrade 1 -> 0 2021-08-19 15:25:32 0 [Note] WSREP: declaring 547622d1-ae8a at tcp://172.30.162.136:4567 stable 2021-08-19 15:25:32 0 [Note] WSREP: PC protocol downgrade 1 -> 0 2021-08-19 15:25:32 0 [Note] WSREP: Node b0966f52-a3d9 state prim 2021-08-19 15:25:32 0 [Note] WSREP: view(view_id(PRIM,547622d1-ae8a,26) memb { 547622d1-ae8a,0 b0966f52-a3d9,0 } joined { } left { } partitioned { }) 2021-08-19 15:25:32 0 [Note] WSREP: save pc into disk 2021-08-19 15:25:32 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2 2021-08-19 15:25:32 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID. 2021-08-19 15:25:33 0 [Note] WSREP: STATE EXCHANGE: sent state msg: ee31c5d7-00f0-11ec-bffa-1ad9577bb578 2021-08-19 15:25:33 0 [Note] WSREP: STATE EXCHANGE: got state msg: ee31c5d7-00f0-11ec-bffa-1ad9577bb578 from 0 (garb) 2021-08-19 15:25:33 0 [Note] WSREP: STATE EXCHANGE: got state msg: ee31c5d7-00f0-11ec-bffa-1ad9577bb578 from 1 (NODE01) 2021-08-19 15:25:33 0 [Note] WSREP: Quorum results: version = 3, component = PRIMARY, conf_id = 1, members = 1/2 (joined/total), act_id = 51736, last_appl. = 51710, protocols = 0/10/4 (gcs/repl/appl), vote policy= 1, group UUID = e5f5e660-ff67-11eb-9128-83f239f41407 2021-08-19 15:25:33 0 [Note] WSREP: Flow-control interval: [23, 23] 2021-08-19 15:25:33 2 [Note] WSREP: ####### processing CC 51736, local, ordered 2021-08-19 15:25:33 2 [Note] WSREP: ####### My UUID: b0966f52-00f0-11ec-a3d9-7f2166d5b22c 2021-08-19 15:25:33 2 [Note] WSREP: Skipping cert index reset 2021-08-19 15:25:33 2 [Note] WSREP: REPL Protocols: 10 (5) 2021-08-19 15:25:33 2 [Note] WSREP: ####### Adjusting cert position: 51736 -> 51736 2021-08-19 15:25:33 0 [Note] WSREP: Service thread queue flushed. 2021-08-19 15:25:33 0 [Note] WSREP: Member 0.0 (garb) requested state transfer from 'any'. Selected 1.0 (NODE01)(SYNCED) as donor. 2021-08-19 15:25:33 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 51736) 2021-08-19 15:25:33 0 [Note] WSREP: 0.0 (garb): State transfer from 1.0 (NODE01) complete. 2021-08-19 15:25:33 2 [ERROR] WSREP: Failed to process action CONFIGURATION, g: 1, l: 28, ptr: 0x7fa4cfd79a78, size: 192: Attempt to reuse the same seqno: 51736. New buffer: addr: 0x7fa4cfd79a60, seqno: 0, size: 216, ctx: 0x563e61a68360, flags: 0. store: 1, type: 0, previous buffer: addr: 0x7fa4cfd798e8, seqno: 51736, size: 376, ctx: 0x563e61a68360, flags: 0. store: 1, type: 0 (FATAL) at /home/buildbot/buildbot/build/gcache/src/GCache_seqno.cpp:seqno_assign():86 2021-08-19 15:25:33 0 [Note] WSREP: Member 0.0 (garb) synced with group. 2021-08-19 15:25:33 2 [Note] WSREP: Closing send monitor... 2021-08-19 15:25:33 2 [Note] WSREP: Closed send monitor. 2021-08-19 15:25:33 2 [Note] WSREP: gcomm: terminating thread 2021-08-19 15:25:33 2 [Note] WSREP: gcomm: joining thread 2021-08-19 15:25:33 2 [Note] WSREP: gcomm: closing backend 2021-08-19 15:25:34 2 [Note] WSREP: view(view_id(NON_PRIM,547622d1-ae8a,26) memb { b0966f52-a3d9,0 } joined { } left { } partitioned { 547622d1-ae8a,0 }) 2021-08-19 15:25:34 2 [Note] WSREP: view((empty)) 2021-08-19 15:25:34 2 [Note] WSREP: gcomm: closed 2021-08-19 15:25:34 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2021-08-19 15:25:34 0 [Note] WSREP: Flow-control interval: [16, 16] 2021-08-19 15:25:34 0 [Note] WSREP: Received NON-PRIMARY. 2021-08-19 15:25:34 0 [Note] WSREP: Shifting DONOR/DESYNCED -> OPEN (TO: 51736) 2021-08-19 15:25:34 0 [Note] WSREP: New SELF-LEAVE. 2021-08-19 15:25:34 0 [Note] WSREP: Flow-control interval: [0, 0] 2021-08-19 15:25:34 0 [Note] WSREP: Received SELF-LEAVE. Closing connection. 2021-08-19 15:25:34 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 51736) 2021-08-19 15:25:34 0 [Note] WSREP: RECV thread exiting 0: Success 2021-08-19 15:25:34 2 [Note] WSREP: recv_thread() joined. 2021-08-19 15:25:34 2 [Note] WSREP: Closing replication queue. 2021-08-19 15:25:34 2 [Note] WSREP: Closing slave action queue. 2021-08-19 15:25:34 2 [Note] WSREP: ================================================ View: id: 00000000-0000-0000-0000-000000000000:-1 status: non-primary protocol_version: -1 capabilities: final: yes own_index: -1 members(0):

2021-08-19 15:25:34 2 [Note] WSREP: Non-primary view 2021-08-19 15:25:34 2 [Note] WSREP: Server status change synced -> disconnected 2021-08-19 15:25:34 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2021-08-19 15:25:34 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2021-08-19 15:25:34 0 [Note] WSREP: Service thread queue flushed. 2021-08-19 15:25:34 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5 2021-08-19 15:25:34 2 [Note] WSREP: Applier thread exiting ret: 8 thd: 2

Best