Cannot create a stable cluster

ariselseng commented 6 years ago

I am using docker swarm. I did exactly as in the docker swarm example, except removing remoing the NODE_ADDRESS variable to let it autoconfigure. I also feel like i have tried everything. Including making the network first and inspecting it to get the ip range, and using that for NODE_ADDRESS variable. Does this project work in its current state?


galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592977786624 [Note] WSREP: save pc into disk
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592977786624 [Note] WSREP: forgetting b663bab1 (tcp://10.0.4.17:4567)
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: STATE EXCHANGE: sent state msg: f3f2f61c-1ba1-11e8-a872-5f4a2014eb6d
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: STATE EXCHANGE: got state msg: f3f2f61c-1ba1-11e8-a872-5f4a2014eb6d from 0 (5dcd2316d4f6)
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: STATE EXCHANGE: got state msg: f3f2f61c-1ba1-11e8-a872-5f4a2014eb6d from 1 (f6c9099d7e52)
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: Quorum results:
galera_node.2.hfps0ofumgos@swarm-master-03    |         version    = 4,
galera_node.2.hfps0ofumgos@swarm-master-03    |         component  = PRIMARY,
galera_node.2.hfps0ofumgos@swarm-master-03    |         conf_id    = 11,
galera_node.2.hfps0ofumgos@swarm-master-03    |         members    = 1/2 (joined/total),
galera_node.2.hfps0ofumgos@swarm-master-03    |         act_id     = 0,
galera_node.2.hfps0ofumgos@swarm-master-03    |         last_appl. = 0,
galera_node.2.hfps0ofumgos@swarm-master-03    |         protocols  = 0/8/3 (gcs/repl/appl),
galera_node.2.hfps0ofumgos@swarm-master-03    |         group UUID = a156782c-1ba0-11e8-9506-f3d6550b8b55
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: Flow-control interval: [23, 23]
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:23 140592969393920 [Note] WSREP: Trying to continue unpaused monitor
   joiner: => Rate:[   0 B/s] Avg:[   0 B/s] Elapsed:0:01:40  Bytes:    0 B ] Avg:[   0 B/s] Elapsed:0:00:10  Bytes:    0 B 
galera_node.2.hfps0ofumgos@swarm-master-03    | WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180227 09:38:28.182)
galera_node.2.hfps0ofumgos@swarm-master-03    | WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180227 09:38:28.186)
galera_node.2.hfps0ofumgos@swarm-master-03    | WSREP_SST: [INFO] Cleaning up fifo file /tmp/mysql-console/fifo (20180227 09:38:28.191)
galera_node.2.hfps0ofumgos@swarm-master-03    | rm: cannot remove '/tmp/mysql-console/fifo': Permission denied
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:28 140592961001216 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.4.16' --datadir '/var/lib/mysql/'   --parent '41'  '' 
galera_node.2.hfps0ofumgos@swarm-master-03    |         Read: '(null)'
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:28 140592961001216 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.4.16' --datadir '/var/lib/mysql/'   --parent '41'  '' : 1 (Operation not permitted)
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:28 140595377510144 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
galera_node.2.hfps0ofumgos@swarm-master-03    | 2018-02-27  9:38:28 140595377510144 [ERROR] Aborting
galera_node.2.hfps0ofumgos@swarm-master-03    | 
galera_node.1.h1m8gi66g9va@node-02    | Error in my_thread_global_end(): 1 threads didn't exit
galera_node.1.h1m8gi66g9va@node-02    | MariaDB exited with return code (0)
galera_node.1.h1m8gi66g9va@node-02    | # GALERA saved state
galera_node.1.h1m8gi66g9va@node-02    | version: 2.1
galera_node.1.h1m8gi66g9va@node-02    | uuid:    00000000-0000-0000-0000-000000000000
galera_node.1.h1m8gi66g9va@node-02    | seqno:   -1
galera_node.1.h1m8gi66g9va@node-02    | safe_to_bootstrap: 1
galera_node.1.h1m8gi66g9va@node-02    | Goodbye

ariselseng commented 6 years ago

I forgot to mention that the nodes only live for 2 minutes. After that new containers replaces them.

smidge84 commented 6 years ago

I am having exactly the same issue. I also noticed that once this has occurred, then the seed node starts failing it's health check. Portainer reports that curl: (22) The requested URL returned error: 503 Service Unavailable which results in the seed container being cycled. I'm trying to read more about galera and xtrabackup in general to try and understand this part of the process and why the comms hang. Some insight would very much be appreciated.

peter-slovak commented 6 years ago

@cowai @smidge84 You're probably hitting https://jira.mariadb.org/browse/MDEV-15383?workflowName=MariaDB+v3&stepId=1 and https://jira.mariadb.org/browse/MDEV-15254 . It seems there are more patches coming, similar to the 10.1.31 release. My personal advice is using the 10.1.31 Dockerfile - I've already spent a few hours trying to get 10.2 running, without success.

ariselseng commented 6 years ago

@peter-slovak Thank you for the tip. Just to confirm. Would the tag "10.1.31-2018-02-20" work okay?

colinmollenhour commented 6 years ago

Yes, I'm fairly confident that 10.1.31-2018-02-20 is stable as it applies some patches to fix the major regressions:

https://github.com/colinmollenhour/mariadb-galera-swarm/blob/master/Dockerfile-10.1#L22

I was able to bootstrap my Kontena cluster using this patched 10.1.31 version. I've been told 10.2 works fine but I have not tried it myself.

ariselseng commented 6 years ago

@colinmollenhour I tried that image, and it has been stable for days now. Thanks! You may want to remove the 10.2 image or warn against it not working. Just a thought.

FalkNisius commented 6 years ago

In my case I found out that in the 10.2 Mariadb image, the tail command used to watch /tmp/mysql../fifo fails, if the swarm node use the overlayfs. The tail from coreutils (8.23.4) in this image has a known bug, that this file system type is unknown. A newer debian brings coreutils 8.24.1, where that is fixed. Perhaps someone can confirm this in a test against aufs as overlay fs.

ariselseng commented 6 years ago

I use aufs and had other problems so this is probably not related. Thanks for the heads up as I wanted to go overlayfs very soon.

colinmollenhour / mariadb-galera-swarm

Cannot create a stable cluster #33