adfinis / openshift-mariadb-galera

Kubernetes / OpenShift Images for a MariaDB Galera Cluster
GNU General Public License v3.0
37 stars 42 forks source link

Failed wsrep_sst_xtrabackup-v2 on Openshift 3.7, Kubernetes 1.7 #17

Closed ThoTischner closed 6 years ago

ThoTischner commented 6 years ago

Hi,

the first container gets ready but the second don`t.

Logs of the second container (mysql-1):

2018-03-22 10:04:13 140329350465792 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2018-03-22 10:04:13 140329350465792 [Note] WSREP: restore pc from disk failed
2018-03-22 10:04:13 140329350465792 [Note] WSREP: GMCast version 0
2018-03-22 10:04:13 140329350465792 [Note] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2018-03-22 10:04:13 140329350465792 [Note] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2018-03-22 10:04:13 140329350465792 [Note] WSREP: EVS version 0
2018-03-22 10:04:13 140329350465792 [Note] WSREP: gcomm: connecting to group 'galera', peer 'mysql-0.galera.galera.svc.cluster.local:,mysql-1.galera.galera.svc.cluster.local:'
2018-03-22 10:04:13 140329350465792 [Note] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') connection established to 5f4ff20f tcp://10.131.0.110:4567
2018-03-22 10:04:13 140329350465792 [Warning] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') address 'tcp://10.131.0.110:4567' points to own listening address, blacklisting
2018-03-22 10:04:13 140329350465792 [Note] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') connection established to 49b580d4 tcp://10.130.1.196:4567
2018-03-22 10:04:13 140329350465792 [Note] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2018-03-22 10:04:13 140329350465792 [Note] WSREP: declaring 49b580d4 at tcp://10.130.1.196:4567 stable
2018-03-22 10:04:13 140329350465792 [Note] WSREP: Node 49b580d4 state prim
2018-03-22 10:04:13 140329350465792 [Note] WSREP: view(view_id(PRIM,49b580d4,2) memb {
    49b580d4,0
    5f4ff20f,0
} joined {
} left {
} partitioned {
})
2018-03-22 10:04:13 140329350465792 [Note] WSREP: save pc into disk
2018-03-22 10:04:14 140329350465792 [Note] WSREP: gcomm: connected
2018-03-22 10:04:14 140329350465792 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2018-03-22 10:04:14 140329350465792 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2018-03-22 10:04:14 140329350465792 [Note] WSREP: Opened channel 'galera'
2018-03-22 10:04:14 140329026053888 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2018-03-22 10:04:14 140329350465792 [Note] WSREP: Waiting for SST to complete.
2018-03-22 10:04:14 140329026053888 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-03-22 10:04:14 140329026053888 [Note] WSREP: STATE EXCHANGE: sent state msg: 5f9d3238-2db8-11e8-a838-6a2b63375367
2018-03-22 10:04:14 140329026053888 [Note] WSREP: STATE EXCHANGE: got state msg: 5f9d3238-2db8-11e8-a838-6a2b63375367 from 0 (mysql-0)
2018-03-22 10:04:14 140329026053888 [Note] WSREP: STATE EXCHANGE: got state msg: 5f9d3238-2db8-11e8-a838-6a2b63375367 from 1 (mysql-1)
2018-03-22 10:04:14 140329026053888 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 1,
    members    = 1/2 (joined/total),
    act_id     = 0,
    last_appl. = -1,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = 49b5ca35-2db8-11e8-93dd-c3424666cb76
2018-03-22 10:04:14 140329026053888 [Note] WSREP: Flow-control interval: [23, 23]
2018-03-22 10:04:14 140329026053888 [Note] WSREP: Trying to continue unpaused monitor
2018-03-22 10:04:14 140329026053888 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 0)
2018-03-22 10:04:14 140329350146816 [Note] WSREP: State transfer required: 
    Group state: 49b5ca35-2db8-11e8-93dd-c3424666cb76:0
    Local state: 00000000-0000-0000-0000-000000000000:-1
2018-03-22 10:04:14 140329350146816 [Note] WSREP: New cluster view: global state: 49b5ca35-2db8-11e8-93dd-c3424666cb76:0, view# 2: Primary, number of nodes: 2, my index: 1, protocol version 3
2018-03-22 10:04:14 140329350146816 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-03-22 10:04:14 140328996697856 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address 'mysql-1.galera.galera.svc.cluster.local' --datadir '/var/lib/mysql/'   --parent '1'  '' '
WSREP_SST: [INFO] Streaming with xbstream (20180322 10:04:14.328)
WSREP_SST: [INFO] Using socat as streamer (20180322 10:04:14.330)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180322 10:04:14.359)
2018-03-22 10:04:16 140329034446592 [Note] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') connection to peer 5f4ff20f with addr tcp://10.131.0.110:4567 timed out, no messages seen in PT3S
2018-03-22 10:04:16 140329034446592 [Note] WSREP: (5f4ff20f, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180322 10:05:54.364)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180322 10:05:54.366)
2018-03-22 10:05:54 140328996697856 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address 'mysql-1.galera.galera.svc.cluster.local' --datadir '/var/lib/mysql/'   --parent '1'  '' 
    Read: '(null)'
2018-03-22 10:05:54 140328996697856 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address 'mysql-1.galera.galera.svc.cluster.local' --datadir '/var/lib/mysql/'   --parent '1'  '' : 32 (Broken pipe)
2018-03-22 10:05:54 140329350146816 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
2018-03-22 10:05:54 140329350146816 [ERROR] Aborting

Error in my_thread_global_end(): 1 threads didn't exit

Logs of the first container (mysql-0):

2018-03-22 10:03:36 139761551272192 [Note] WSREP: save pc into disk
--
  | 2018-03-22 10:03:36 139761551272192 [Note] WSREP: gcomm: connected
  | 2018-03-22 10:03:36 139761551272192 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
  | 2018-03-22 10:03:36 139761551272192 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
  | 2018-03-22 10:03:36 139761551272192 [Note] WSREP: Opened channel 'galera'
  | 2018-03-22 10:03:36 139761551272192 [Note] WSREP: Waiting for SST to complete.
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: Starting new group from scratch: 49b5ca35-2db8-11e8-93dd-c3424666cb76
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 49b5cbec-2db8-11e8-b4a7-7a948ae5dbe5
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: STATE EXCHANGE: sent state msg: 49b5cbec-2db8-11e8-b4a7-7a948ae5dbe5
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: STATE EXCHANGE: got state msg: 49b5cbec-2db8-11e8-b4a7-7a948ae5dbe5 from 0 (mysql-0)
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: Quorum results:
  | version    = 4,
  | component  = PRIMARY,
  | conf_id    = 0,
  | members    = 1/1 (joined/total),
  | act_id     = 0,
  | last_appl. = -1,
  | protocols  = 0/7/3 (gcs/repl/appl),
  | group UUID = 49b5ca35-2db8-11e8-93dd-c3424666cb76
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: Flow-control interval: [16, 16]
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: Trying to continue unpaused monitor
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: Restored state OPEN -> JOINED (0)
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: Member 0.0 (mysql-0) synced with group.
  | 2018-03-22 10:03:36 139761226344192 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
  | 2018-03-22 10:03:36 139761550953216 [Note] WSREP: New cluster view: global state: 49b5ca35-2db8-11e8-93dd-c3424666cb76:0, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3
  | 2018-03-22 10:03:36 139761551272192 [Note] WSREP: SST complete, seqno: 0
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: innodb_empty_free_list_algorithm has been changed to legacy because of small buffer pool size. In order to use backoff, increase buffer pool at least up to 20MB.
  |  
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: Using mutexes to ref count buffer pool pages
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: The InnoDB memory heap is disabled
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: Compressed tables use zlib 1.2.7
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: Using Linux native AIO
  | 2018-03-22 10:03:36 139761551272192 [Note] InnoDB: Using generic crc32 instructions
  | 2018-03-22 10:03:37 139761551272192 [Note] InnoDB: Initializing buffer pool, size = 128.0M
  | 2018-03-22 10:03:37 139761551272192 [Note] InnoDB: Completed initialization of buffer pool
  | 2018-03-22 10:03:37 139761551272192 [Note] InnoDB: Highest supported file format is Barracuda.
  | 2018-03-22 10:03:37 139761551272192 [Note] InnoDB: 128 rollback segment(s) are active.
  | 2018-03-22 10:03:37 139761551272192 [Note] InnoDB: Waiting for purge to start
  | 2018-03-22 10:03:37 139761551272192 [Note] InnoDB:  Percona XtraDB (http://www.percona.com ) 5.6.36-83.0 started; log sequence number 1616727
  | 2018-03-22 10:03:37 139760517510912 [Note] InnoDB: Dumping buffer pool(s) not yet started
  | 2018-03-22 10:03:37 139761551272192 [Note] Plugin 'FEEDBACK' is disabled.
  | 2018-03-22 10:03:37 139761551272192 [Note] Server socket created on IP: '0.0.0.0'.
  | 2018-03-22 10:03:37 139761551272192 [Note] mysqld: ready for connections.
  | Version: '10.1.31-MariaDB'  socket: '/var/run/mysql/mysql.sock'  port: 3306  MariaDB Server
  | 2018-03-22 10:03:37 139761550953216 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
  | 2018-03-22 10:03:37 139761550953216 [Note] WSREP: REPL Protocols: 7 (3, 2)
  | 2018-03-22 10:03:37 139761550953216 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
  | 2018-03-22 10:03:37 139761284654848 [Note] WSREP: Service thread queue flushed.
  | 2018-03-22 10:03:37 139761550953216 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 49b5ca35-2db8-11e8-93dd-c3424666cb76:0
  | 2018-03-22 10:03:37 139761550953216 [Note] WSREP: Synchronized with group, ready for connections
  | 2018-03-22 10:03:37 139761550953216 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
  | 2018-03-22 10:04:13 139761234736896 [Note] WSREP: (49b580d4, 'tcp://0.0.0.0:4567') connection established to 5f4ff20f tcp://10.131.0.110:4567
  | 2018-03-22 10:04:13 139761234736896 [Note] WSREP: (49b580d4, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
  | 2018-03-22 10:04:13 139761234736896 [Note] WSREP: declaring 5f4ff20f at tcp://10.131.0.110:4567 stable
  | 2018-03-22 10:04:13 139761234736896 [Note] WSREP: Node 49b580d4 state prim
  | 2018-03-22 10:04:13 139761234736896 [Note] WSREP: view(view_id(PRIM,49b580d4,2) memb {
  | 49b580d4,0
  | 5f4ff20f,0
  | } joined {
  | } left {
  | } partitioned {
  | })
  | 2018-03-22 10:04:13 139761234736896 [Note] WSREP: save pc into disk
  | 2018-03-22 10:04:13 139761226344192 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
  | 2018-03-22 10:04:13 139761226344192 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 5f9d3238-2db8-11e8-a838-6a2b63375367
  | 2018-03-22 10:04:13 139761226344192 [Note] WSREP: STATE EXCHANGE: sent state msg: 5f9d3238-2db8-11e8-a838-6a2b63375367
  | 2018-03-22 10:04:13 139761226344192 [Note] WSREP: STATE EXCHANGE: got state msg: 5f9d3238-2db8-11e8-a838-6a2b63375367 from 0 (mysql-0)
  | 2018-03-22 10:04:14 139761226344192 [Note] WSREP: STATE EXCHANGE: got state msg: 5f9d3238-2db8-11e8-a838-6a2b63375367 from 1 (mysql-1)
  | 2018-03-22 10:04:14 139761226344192 [Note] WSREP: Quorum results:
  | version    = 4,
  | component  = PRIMARY,
  | conf_id    = 1,
  | members    = 1/2 (joined/total),
  | act_id     = 0,
  | last_appl. = 0,
  | protocols  = 0/7/3 (gcs/repl/appl),
  | group UUID = 49b5ca35-2db8-11e8-93dd-c3424666cb76
  | 2018-03-22 10:04:14 139761226344192 [Note] WSREP: Flow-control interval: [23, 23]
  | 2018-03-22 10:04:14 139761226344192 [Note] WSREP: Trying to continue unpaused monitor
  | 2018-03-22 10:04:14 139761550953216 [Note] WSREP: New cluster view: global state: 49b5ca35-2db8-11e8-93dd-c3424666cb76:0, view# 2: Primary, number of nodes: 2, my index: 0, protocol version 3
  | 2018-03-22 10:04:14 139761550953216 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
  | 2018-03-22 10:04:14 139761550953216 [Note] WSREP: REPL Protocols: 7 (3, 2)
  | 2018-03-22 10:04:14 139761550953216 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
  | 2018-03-22 10:04:14 139761284654848 [Note] WSREP: Service thread queue flushed.
  | 2018-03-22 10:04:16 139761234736896 [Note] WSREP: (49b580d4, 'tcp://0.0.0.0:4567') turning message relay requesting off
  | 2018-03-22 10:05:57 139761234736896 [Note] WSREP: forgetting 5f4ff20f (tcp://10.131.0.110:4567)
  | 2018-03-22 10:05:57 139761234736896 [Note] WSREP: Node 49b580d4 state prim
  | 2018-03-22 10:05:57 139761234736896 [Note] WSREP: view(view_id(PRIM,49b580d4,3) memb {
  | 49b580d4,0
  | } joined {
  | } left {
  | } partitioned {
  | 5f4ff20f,0
  | })
  | 2018-03-22 10:05:57 139761234736896 [Note] WSREP: save pc into disk
  | 2018-03-22 10:05:57 139761234736896 [Note] WSREP: forgetting 5f4ff20f (tcp://10.131.0.110:4567)
  | 2018-03-22 10:05:57 139761226344192 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
  | 2018-03-22 10:05:57 139761226344192 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 9d7549e4-2db8-11e8-a213-ee2e0ed4bdf3
  | 2018-03-22 10:05:57 139761226344192 [Note] WSREP: STATE EXCHANGE: sent state msg: 9d7549e4-2db8-11e8-a213-ee2e0ed4bdf3
  | 2018-03-22 10:05:57 139761226344192 [Note] WSREP: STATE EXCHANGE: got state msg: 9d7549e4-2db8-11e8-a213-ee2e0ed4bdf3 from 0 (mysql-0)
  | 2018-03-22 10:05:57 139761226344192 [Note] WSREP: Quorum results:
  | version    = 4,
  | component  = PRIMARY,
  | conf_id    = 2,
  | members    = 1/1 (joined/total),
  | act_id     = 0,
  | last_appl. = 0,
  | protocols  = 0/7/3 (gcs/repl/appl),
  | group UUID = 49b5ca35-2db8-11e8-93dd-c3424666cb76
  | 2018-03-22 10:05:57 139761226344192 [Note] WSREP: Flow-control interval: [16, 16]
  | 2018-03-22 10:05:57 139761226344192 [Note] WSREP: Trying to continue unpaused monitor
  | 2018-03-22 10:05:57 139761550953216 [Note] WSREP: New cluster view: global state: 49b5ca35-2db8-11e8-93dd-c3424666cb76:0, view# 3: Primary, number of nodes: 1, my index: 0, protocol version 3
  | 2018-03-22 10:05:57 139761550953216 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
  | 2018-03-22 10:05:57 139761550953216 [Note] WSREP: REPL Protocols: 7 (3, 2)
  | 2018-03-22 10:05:57 139761550953216 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
  | 2018-03-22 10:05:57 139761284654848 [Note] WSREP: Service thread queue flushed.
  | 2018-03-22 10:06:01 139761234736896 [Note] WSREP: (49b580d4, 'tcp://0.0.0.0:4567') connection established to 5f4ff20f tcp://10.131.0.110:4567
  | 2018-03-22 10:06:01 139761234736896 [Warning] WSREP: discarding established (time wait) 5f4ff20f (tcp://10.131.0.110:4567)
  | 2018-03-22 10:06:02 139761234736896 [Note] WSREP:  cleaning up 5f4ff20f (tcp://10.131.0.110:4567)
  | 2018-03-22 10:06:27 139761234736896 [Note] WSREP: (49b580d4, 'tcp://0.0.0.0:4567') connection established to af9c3439 tcp://10.131.0.110:4567
  | 2018-03-22 10:06:27 139761234736896 [Note] WSREP: (49b580d4, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
  | 2018-03-22 10:06:28 139761234736896 [Note] WSREP: declaring af9c3439 at tcp://10.131.0.110:4567 stable
  | 2018-03-22 10:06:28 139761234736896 [Note] WSREP: Node 49b580d4 state prim
  | 2018-03-22 10:06:28 139761234736896 [Note] WSREP: view(view_id(PRIM,49b580d4,4) memb {
  | 49b580d4,0
  | af9c3439,0
  | } joined {
  | } left {
  | } partitioned {
  | })
  | 2018-03-22 10:06:28 139761234736896 [Note] WSREP: save pc into disk
  | 2018-03-22 10:06:28 139761226344192 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
  | 2018-03-22 10:06:28 139761226344192 [Note] WSREP: STATE_EXCHANGE: sent state UUID: afe96f99-2db8-11e8-a120-7fc34bc2f1d7
  | 2018-03-22 10:06:28 139761226344192 [Note] WSREP: STATE EXCHANGE: sent state msg: afe96f99-2db8-11e8-a120-7fc34bc2f1d7
  | 2018-03-22 10:06:28 139761226344192 [Note] WSREP: STATE EXCHANGE: got state msg: afe96f99-2db8-11e8-a120-7fc34bc2f1d7 from 0 (mysql-0)
  | 2018-03-22 10:06:28 139761226344192 [Note] WSREP: STATE EXCHANGE: got state msg: afe96f99-2db8-11e8-a120-7fc34bc2f1d7 from 1 (mysql-1)
  | 2018-03-22 10:06:28 139761226344192 [Note] WSREP: Quorum results:
  | version    = 4,
  | component  = PRIMARY,
  | conf_id    = 3,
  | members    = 1/2 (joined/total),
  | act_id     = 0,
  | last_appl. = 0,
  | protocols  = 0/7/3 (gcs/repl/appl),
  | group UUID = 49b5ca35-2db8-11e8-93dd-c3424666cb76
  | 2018-03-22 10:06:28 139761226344192 [Note] WSREP: Flow-control interval: [23, 23]

Thanks, Thomas

tongpu commented 6 years ago

This error isn't caused by Kubernetes or OpenShift but stems from an upstream error in MariaDB: http://lists.askmonty.org/pipermail/commits/2018-February/012037.html

In both MariaDB 10.1.31 and 10.2.13 the script installed at /usr/bin/wsrep_sst_xtrabackup-v2 is broken and needs to be fixed. I'm going to discuss how we're going to fix this downstream, because it clearly is an upstream issue.

ThoTischner commented 6 years ago

Okay, is there any workaround at this time? Maybe i can use an older version of your docker image?

tongpu commented 6 years ago

I've implemented a workaround in 3f86a26e11c5b594a8d28236eb4c2b7eef216c1b by manually patching the wsrep_sst_xtrabackup-v2 script. I've pushed the updated image as v004 again, so you'll only need to pull the image again.

ThoTischner commented 6 years ago

Hey thanks i will try =)

ThoTischner commented 6 years ago

It`s working now thanks alot.