colinmollenhour / mariadb-galera-swarm

MariaDb Galera Cluster container based on official mariadb image which can auto-bootstrap and recover cluster state.
https://hub.docker.com/r/colinmollenhour/mariadb-galera-swarm
Apache License 2.0
219 stars 103 forks source link

Node shutdown after seed removal (continued) #65

Closed alphaDev23 closed 5 years ago

alphaDev23 commented 5 years ago

An issue related to the same was closed in November; however, this issue is not resolved. https://github.com/colinmollenhour/mariadb-galera-swarm/issues/51

I'm using 18.06.1-ce Below is the log from the node which went from a status of primary to non-primary after the seed was removed and then shutdown. NOTE: There are crossed out lines but that appears only because of the application's html formatting from which I pulled the log.

The appropriate sequence of events from the remaining nodes are different (it does not change its status from primary to non-primary) after a node is removed per: https://severalnines.com/blog/galera-cluster-recovery-101-deep-dive-network-partitioning

Thoughts on a resolution? At this point, I do not see how this code works with swarm once the seed is removed.

...------======------... MariaDB Galera Start Script ...------======------..., Got NODE_ADDRESS=10.0.23.29, Starting node, connecting to gcomm://10.0.23.5, Tailing /tmp/mysql-console/fifo..., ===|mysqld.sh|===: Recovered position from grastate.dat: a122b452-175e-11e9-8164-e62c82786503:2914, ===|mysqld.sh|===: Safe to bootstrap: 0, Galera Cluster Node status: synced, ===|mysqld.sh|===: Node at 10.0.23.5 is healthy!, ===|mysqld.sh|===: Found a healthy node! Attempting to join..., ===|mysqld.sh|===: ---------------------------------------------------------------, ===|mysqld.sh|===: Starting with options: --console --wsrep_sst_auth=xtrabackup:litmusblue --wsrep-on=ON --wsrep-sst-method=xtrabackup-v2 --wsrep_cluster_name=cluster --wsrep_cluster_address=gcomm://10.0.23.5 --wsrep_node_address=10.0.23.29:4567 --default-time-zone=+00:00 --wsrep_start_position=a122b452-175e-11e9-8164-e62c82786503:2914, 2019-01-13 20:15:52 140320666687424 [Note] mysqld (mysqld 10.2.18-MariaDB-1:10.2.18+maria~bionic) starting as process 46 ..., 2019-01-13 20:15:52 140320666687424 [Note] WSREP: Read nil XID from storage engines, skipping position init, 2019-01-13 20:15:52 140320666687424 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so', 2019-01-13 20:15:52 140320666687424 [Note] WSREP: wsrep_load(): Galera 25.3.24(r3825) by Codership Oy info@codership.com loaded successfully., 2019-01-13 20:15:52 140320666687424 [Note] WSREP: CRC-32C: using hardware acceleration., 2019-01-13 20:15:52 140320666687424 [Note] WSREP: Found saved state: a122b452-175e-11e9-8164-e62c82786503:2914, safe_to_bootstrap: 0, 2019-01-13 20:15:52 140320666687424 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: a122b452-175e-11e9-8164-e62c82786503, offset: -1, 2019-01-13 20:15:52 140320666687424 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/2147483688 bytes) complete., 2019-01-13 20:15:56 140320666687424 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (2147483688/2147483688 bytes) complete., 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 2914-2914, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: GCache::RingBuffer unused buffers scan... 0.0% ( 0/2824 bytes) complete., 2019-01-13 20:15:56 140320666687424 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (2824/2824 bytes) complete., 2019-01-13 20:15:56 140320666687424 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 0/1 locked buffers, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 2824/2147483648, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.0.23.29; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 1024M; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 2048M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; , 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Assign initial position for certification: 2914, protocol version: -1, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: wsrep_sst_grab(), 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Start replication, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Setting initial position to a122b452-175e-11e9-8164-e62c82786503:2914, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: protonet asio version 0, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Using CRC-32C for message checksums., 2019-01-13 20:15:56 140320666687424 [Note] WSREP: backend: asio, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: gcomm thread scheduling priority set to other:0 , 2019-01-13 20:15:56 140320666687424 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory), 2019-01-13 20:15:56 140320666687424 [Note] WSREP: restore pc from disk failed, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: GMCast version 0, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') multicast: , ttl: 1, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: EVS version 0, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: gcomm: connecting to group 'cluster', peer '10.0.23.5:', 2019-01-13 20:15:56 140320666687424 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') connection established to 6a1450ce tcp://10.0.23.5:4567, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.0.23.3:4567 , 2019-01-13 20:15:56 140320666687424 [Note] WSREP: declaring 6a1450ce at tcp://10.0.23.5:4567 stable, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: declaring 91bf1a62 at tcp://10.0.23.3:4567 stable, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Node 6a1450ce state prim, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: view(view_id(PRIM,08a8d172,6) memb {, 08a8d172,0, 6a1450ce,0, 91bf1a62,0, } joined {, } left {, } partitioned {, }), 2019-01-13 20:15:56 140320666687424 [Note] WSREP: save pc into disk, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: gcomm: connected, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0), 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Opened channel 'cluster', 2019-01-13 20:15:56 140318259275520 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3, 2019-01-13 20:15:56 140318259275520 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 08f5da22-1770-11e9-9885-7f4939202bda, 2019-01-13 20:15:56 140320666687424 [Note] WSREP: Waiting for SST to complete., 2019-01-13 20:15:56 140318259275520 [Note] WSREP: STATE EXCHANGE: sent state msg: 08f5da22-1770-11e9-9885-7f4939202bda, 2019-01-13 20:15:56 140318259275520 [Note] WSREP: STATE EXCHANGE: got state msg: 08f5da22-1770-11e9-9885-7f4939202bda from 0 (f4fe8c340a39), 2019-01-13 20:15:56 140318259275520 [Note] WSREP: STATE EXCHANGE: got state msg: 08f5da22-1770-11e9-9885-7f4939202bda from 1 (43116c1e29c2), 2019-01-13 20:15:56 140318259275520 [Note] WSREP: STATE EXCHANGE: got state msg: 08f5da22-1770-11e9-9885-7f4939202bda from 2 (9a89a2dd6d2b), 2019-01-13 20:15:56 140318259275520 [Note] WSREP: Quorum results:, version = 4,, component = PRIMARY,, conf_id = 4,, members = 2/3 (joined/total),, act_id = 3885,, last_appl. = -1,, protocols = 0/9/3 (gcs/repl/appl),, group UUID = a122b452-175e-11e9-8164-e62c82786503, 2019-01-13 20:15:56 140318259275520 [Note] WSREP: Flow-control interval: [28, 28], 2019-01-13 20:15:56 140318259275520 [Note] WSREP: Trying to continue unpaused monitor, 2019-01-13 20:15:56 140318259275520 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3885), 2019-01-13 20:15:56 140320665188096 [Note] WSREP: State transfer required: , Group state: a122b452-175e-11e9-8164-e62c82786503:3885, Local state: a122b452-175e-11e9-8164-e62c82786503:2914, 2019-01-13 20:15:56 140320665188096 [Note] WSREP: New cluster view: global state: a122b452-175e-11e9-8164-e62c82786503:3885, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3, 2019-01-13 20:15:56 140320665188096 [Warning] WSREP: Gap in state sequence. Need state transfer., 2019-01-13 20:15:56 140318250882816 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.23.29' --datadir '/var/lib/mysql/' --parent '46' '' ''', WSREP_SST: [INFO] Streaming with xbstream (20190113 20:15:56.845), WSREP_SST: [INFO] Using socat as streamer (20190113 20:15:56.848), WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | pv -f -i 10 -N joiner -F '%N => Rate:%r Avg:%a Elapsed:%t %e Bytes: %b %p' 2>>/tmp/mysql-console/fifo | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20190113 20:15:56.903), /usr/local/bin/start.sh: line 413: run-upgrades.sh: command not found, 2019-01-13 20:15:57 140320665188096 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.0.23.29:4444/xtrabackup_sst//1, 2019-01-13 20:15:57 140320665188096 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification., 2019-01-13 20:15:57 140320665188096 [Note] WSREP: REPL Protocols: 9 (4, 2), 2019-01-13 20:15:57 140320665188096 [Note] WSREP: Assign initial position for certification: 3885, protocol version: 4, 2019-01-13 20:15:57 140320515901184 [Note] WSREP: Service thread queue flushed., 2019-01-13 20:15:57 140320665188096 [Note] WSREP: IST receiver addr using tcp://10.0.23.29:4568, 2019-01-13 20:15:57 140320665188096 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.0.23.29:4568, 2019-01-13 20:15:57 140318259275520 [Note] WSREP: Member 0.0 (f4fe8c340a39) requested state transfer from 'any'. Selected 1.0 (43116c1e29c2)(SYNCED) as donor., 2019-01-13 20:15:57 140318259275520 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3885), 2019-01-13 20:15:57 140320665188096 [Note] WSREP: Requesting state transfer: success, donor: 1, 2019-01-13 20:15:57 140320665188096 [Note] WSREP: GCache history reset: a122b452-175e-11e9-8164-e62c82786503:2914 -> a122b452-175e-11e9-8164-e62c82786503:3885, 2019-01-13 20:15:57 140320665188096 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset, 2019-01-13 20:15:58 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 0, joiner: => Rate:[ 158 B/s] Avg:[ 158 B/s] Elapsed:0:00:01 Bytes: 214 B , 2019-01-13 20:15:58 140318259275520 [Note] WSREP: 1.0 (43116c1e29c2): State transfer to 0.0 (f4fe8c340a39) complete., 2019-01-13 20:15:58 140318259275520 [Note] WSREP: Member 1.0 (43116c1e29c2) synced with group., WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20190113 20:15:58.264), WSREP_SST: [INFO] Galera co-ords from recovery: a122b452-175e-11e9-8164-e62c82786503:2914 0 (20190113 20:15:58.270), WSREP_SST: [INFO] Total time on joiner: 0 seconds (20190113 20:15:58.275), WSREP_SST: [INFO] Removing the sst_in_progress file (20190113 20:15:58.279), WSREP_SST: [INFO] Cleaning up fifo file /tmp/mysql-console/fifo (20190113 20:15:58.284), rm: cannot remove '/tmp/mysql-console/fifo': Permission denied, 2019-01-13 20:15:58 140318250882816 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.23.29' --datadir '/var/lib/mysql/' --parent '46' '' '': 1 (Operation not permitted), 2019-01-13 20:15:58 140320666687424 [Note] WSREP: SST complete, seqno: 2914, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Uses event mutexes, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Compressed tables use zlib 1.2.11, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Using Linux native AIO, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Number of pools: 1, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Using SSE2 crc32 instructions, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Initializing buffer pool, total size = 256M, instances = 1, chunk size = 128M, 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Completed initialization of buffer pool, 2019-01-13 20:15:58 140317243234048 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority()., 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Highest supported file format is Barracuda., 2019-01-13 20:15:58 140320666687424 [Note] InnoDB: Starting crash recovery from checkpoint LSN=14063401, 2019-01-13 20:15:59 140320666687424 [Note] InnoDB: 128 out of 128 rollback segments are active., 2019-01-13 20:15:59 140320666687424 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1", 2019-01-13 20:15:59 140320666687424 [Note] InnoDB: Creating shared tablespace for temporary tables, 2019-01-13 20:15:59 140320666687424 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ..., 2019-01-13 20:15:59 140320666687424 [Note] InnoDB: File './ibtmp1' size is now 12 MB., 2019-01-13 20:15:59 140320666687424 [Note] InnoDB: Waiting for purge to start, 2019-01-13 20:15:59 140320666687424 [Note] InnoDB: 5.7.23 started; log sequence number 14063410, 2019-01-13 20:15:59 140317084858112 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool, 2019-01-13 20:15:59 140317084858112 [Note] InnoDB: Buffer pool(s) load completed at 190113 20:15:59, 2019-01-13 20:15:59 140320666687424 [Note] Plugin 'FEEDBACK' is disabled., 2019-01-13 20:15:59 140320666687424 [Note] Recovering after a crash using tc.log, 2019-01-13 20:15:59 140320666687424 [Note] Starting crash recovery..., 2019-01-13 20:15:59 140320666687424 [Note] Crash recovery finished., 2019-01-13 20:15:59 140320666687424 [Note] Server socket created on IP: '0.0.0.0'., 2019-01-13 20:15:59 140320666687424 [Warning] 'user' entry 'root@b9f2d8463958' ignored in --skip-name-resolve mode., 2019-01-13 20:15:59 140320666687424 [Warning] 'proxies_priv' entry '@% root@b9f2d8463958' ignored in --skip-name-resolve mode., 2019-01-13 20:16:00 140320666687424 [Note] WSREP: Signalling provider to continue., 2019-01-13 20:16:00 140320666687424 [Note] WSREP: SST received: a122b452-175e-11e9-8164-e62c82786503:2914, 2019-01-13 20:16:00 140320665188096 [Note] WSREP: Receiving IST: 971 writesets, seqnos 2914-3885, 2019-01-13 20:16:00 140318242490112 [Note] WSREP: Receiving IST... 0.0% ( 0/971 events) complete., 2019-01-13 20:16:00 140320666687424 [Note] Reading of all Master_info entries succeded, 2019-01-13 20:16:00 140320666687424 [Note] Added new Master_info '' to hash table, 2019-01-13 20:16:00 140320666687424 [Note] mysqld: ready for connections., Version: '10.2.18-MariaDB-1:10.2.18+maria~bionic' socket: '/var/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution, 2019-01-13 20:16:06 140318242490112 [Note] WSREP: Receiving IST...100.0% (971/971 events) complete., 2019-01-13 20:16:06 140320665188096 [Note] WSREP: IST received: a122b452-175e-11e9-8164-e62c82786503:3885, 2019-01-13 20:16:06 140318259275520 [Note] WSREP: 0.0 (f4fe8c340a39): State transfer from 1.0 (43116c1e29c2) complete., 2019-01-13 20:16:06 140318259275520 [Note] WSREP: Shifting JOINER -> JOINED (TO: 3885), 2019-01-13 20:16:06 140318259275520 [Note] WSREP: Member 0.0 (f4fe8c340a39) synced with group., 2019-01-13 20:16:06 140318259275520 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 3885), 2019-01-13 20:16:06 140320665188096 [Note] WSREP: Synchronized with group, ready for connections, 2019-01-13 20:16:06 140320665188096 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification., 2019-01-13 20:16:42 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 30, 2019-01-13 20:17:26 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 60, 2019-01-13 20:18:11 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 90, 2019-01-13 20:18:55 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 120, 2019-01-13 20:19:40 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 150, 2019-01-13 20:20:24 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 180, 2019-01-13 20:21:09 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 210, 2019-01-13 20:21:54 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 240, 2019-01-13 20:22:38 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 270, 2019-01-13 20:23:23 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 300, 2019-01-13 20:24:08 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 330, 2019-01-13 20:24:53 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 360, 2019-01-13 20:25:38 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 390, 2019-01-13 20:25:58 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 6a1450ce (tcp://10.0.23.5:4567), attempt 0, 2019-01-13 20:26:01 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 6a1450ce, 2019-01-13 20:26:01 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 6a1450ce, 2019-01-13 20:26:01 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 91bf1a62, 2019-01-13 20:26:02 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 6a1450ce, 2019-01-13 20:26:02 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 91bf1a62, 2019-01-13 20:26:02 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 6a1450ce, 2019-01-13 20:26:02 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 91bf1a62, 2019-01-13 20:26:03 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 6a1450ce, 2019-01-13 20:26:03 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 91bf1a62, 2019-01-13 20:26:03 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 6a1450ce, 2019-01-13 20:26:03 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) suspecting node: 91bf1a62, 2019-01-13 20:26:04 140320482330368 [Warning] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) install timer expired, evs::proto(evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)), GATHER) {, current_view=view(view_id(REG,08a8d172,6) memb {, 08a8d172,0, 6a1450ce,0, 91bf1a62,0, } joined {, } left {, } partitioned {, }),, input_map=evs::input_map: {aru_seq=1740,safe_seq=1740,node_index=node: {idx=0,range=[1741,1740],safe_seq=1740} node: {idx=1,range=[1741,1740],safe_seq=1740} node: {idx=2,range=[1741,1740],safe_seq=1740} },, fifo_seq=3501,, last_sent=1740,, known:, 08a8d172 at {o=1,s=0,i=0,fs=-1,jm=, {v=0,t=4,ut=255,o=1,s=1740,sr=-1,as=1740,f=0,src=08a8d172,srcvid=view_id(REG,08a8d172,6),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=3492,nl=(

}, 08a8d172, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,08a8d172,6),ss=1740,ir=[1741,1740],}, 6a1450ce, {o=0,s=0,e=0,ls=1740,vid=view_id(REG,08a8d172,6),ss=1740,ir=[1741,1740],}, 91bf1a62, {o=1,s=0,e=0,ls=-1,vid=view_id(REG,08a8d172,6),ss=1740,ir=[1741,1740],}, ), },, 6a1450ce at tcp://10.0.23.5:4567, {o=0,s=1,i=0,fs=5143,lm=, {v=0,t=6,ut=255,o=1,s=1740,sr=-1,as=1740,f=4,src=6a1450ce,srcvid=view_id(REG,08a8d172,6),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=5143,nl=(, 91bf1a62 at tcp://10.0.23.3:4567, {o=1,s=1,i=0,fs=3840,jm=, {v=0,t=4,ut=255,o=1,s=1740,sr=-1,as=1740,f=4,src=91bf1a62,srcvid=view_id(REG,08a8d172,6),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=3840,nl=(, install msg={v=0,t=5,ut=255,o=1,s=1740,sr=-1,as=1740,f=6,src=08a8d172,srcvid=view_id(REG,08a8d172,6),insvid=view_id(REG,08a8d172,7),ru=00000000,r=[-1,-1],fs=3493,nl=(

}, 2019-01-13 20:26:04 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) node 6a1450ce failed to commit for install message, declaring inactive, 2019-01-13 20:26:04 140320482330368 [Note] WSREP: evs::proto(08a8d172, GATHER, view_id(REG,08a8d172,6)) node 91bf1a62 failed to commit for install message, declaring inactive, 2019-01-13 20:26:04 140320482330368 [Note] WSREP: view(view_id(NON_PRIM,08a8d172,6) memb {, 08a8d172,0, } joined {, } left {, } partitioned {, 6a1450ce,0, 91bf1a62,0, }), 2019-01-13 20:26:04 140318259275520 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1, 2019-01-13 20:26:04 140318259275520 [Note] WSREP: Flow-control interval: [16, 16], 2019-01-13 20:26:04 140318259275520 [Note] WSREP: Trying to continue unpaused monitor, 2019-01-13 20:26:04 140318259275520 [Note] WSREP: Received NON-PRIMARY., 2019-01-13 20:26:04 140318259275520 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 3885), 2019-01-13 20:26:04 140320482330368 [Note] WSREP: forgetting 6a1450ce (tcp://10.0.23.5:4567), 2019-01-13 20:26:04 140320482330368 [Note] WSREP: deleting entry tcp://10.0.23.5:4567, 2019-01-13 20:26:04 140320482330368 [Note] WSREP: view(view_id(NON_PRIM,08a8d172,8) memb {, 2019-01-13 20:26:04 140320665188096 [Note] WSREP: New cluster view: global state: a122b452-175e-11e9-8164-e62c82786503:3885, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3, 2019-01-13 20:26:04 140320665188096 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2019-01-13 20:26:09 140320482330368 [Note] WSREP: cleaning up 6a1450ce (tcp://10.0.23.5:4567), 2019-01-13 20:26:23 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 420, 2019-01-13 20:27:07 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 450, 2019-01-13 20:27:52 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 480, 2019-01-13 20:28:37 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 510, 2019-01-13 20:29:22 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 540, 2019-01-13 20:30:06 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 570, 2019-01-13 20:30:51 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 600, 2019-01-13 20:31:36 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 630, 2019-01-13 20:32:21 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 660, 2019-01-13 20:33:06 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 690, 2019-01-13 20:33:51 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 720, 2019-01-13 20:34:36 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 750, 2019-01-13 20:35:21 140320482330368 [Note] WSREP: (08a8d172, 'tcp://0.0.0.0:4567') reconnecting to 91bf1a62 (tcp://10.0.23.3:4567), attempt 780, Received TERM|INT signal., Shutting down..., 2019-01-13 20:35:55 140317190452992 [Note] mysqld (initiated by: system[system] @ [127.0.0.1]): Normal shutdown, 2019-01-13 20:35:55 140317190452992 [Note] WSREP: Stop replication, 2019-01-13 20:35:55 140317190452992 [Note] WSREP: Closing send monitor..., 2019-01-13 20:35:55 140317190452992 [Note] WSREP: Closed send monitor., 2019-01-13 20:35:55 140317190452992 [Note] WSREP: gcomm: terminating thread, 2019-01-13 20:35:55 140317190452992 [Note] WSREP: gcomm: joining thread, 2019-01-13 20:35:55 140317190452992 [Note] WSREP: gcomm: closing backend, MariaDB exited with return code (0), 2019-01-13 20:35:55 140317190452992 [Note] WSREP: view((empty)), 2019-01-13 20:35:55 140317190452992 [Note] WSREP: gcomm: closed, 2019-01-13 20:35:55 140318259275520 [Note] WSREP: Received self-leave message., 2019-01-13 20:35:55 140318259275520 [Note] WSREP: Flow-control interval: [0, 0], 2019-01-13 20:35:55 140318259275520 [Note] WSREP: Trying to continue unpaused monitor, 2019-01-13 20:35:55 140318259275520 [Note] WSREP: Received SELF-LEAVE. Closing connection., 2019-01-13 20:35:55 140318259275520 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 3885), 2019-01-13 20:35:55 140318259275520 [Note] WSREP: RECV thread exiting 0: Success, 2019-01-13 20:35:55 140320665188096 [Note] WSREP: New cluster view: global state: a122b452-175e-11e9-8164-e62c82786503:3885, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3, 2019-01-13 20:35:55 140320665188096 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification., 2019-01-13 20:35:55 140317190452992 [Note] WSREP: Closing replication queue., 2019-01-13 20:35:55 140317190452992 [Note] WSREP: recv_thread() joined., 2019-01-13 20:35:55 140317190452992 [Note] WSREP: Closing slave action queue.,

GALERA saved state

version 2.1 uuid: a122b452-175e-11e9-8164-e62c82786503 seqno: 3885 safe_to_bootstrap: 0, 2019-01-13 20:35:55 140320665188096 [Note] WSREP: applier thread exiting (code:0), Goodbye

alphaDev23 commented 5 years ago

Is there any update on this issue and has this been tested on docker swarm? The suggested compose file does not appear to work at all and results in the cluster's failing with the above error.

The root cause of the issue appears to be that the nodes (not the seed) are assigned to different IPs than what the cluster has registered, For example, a node with 10.0.23.3 in the above example does not actually exist and the node IP is a different but on the subnet, e.g. 10.0.23.10. Seed IPs do not appear to be an issue probably because the endpoint_mode is 'vip' (the default) vs 'dsnrr' (per the compose file for nodes). If the endpoint_mode is the default for the nodes, the cluster obviously doesn't work.

colinmollenhour commented 5 years ago

So it seems like there are two pretty simple possible solutions:

  1. Use dnsrr instead of vip.
  2. Use "tasks.node" instead of "node" as the DNS lookup for the service.

I think 2 is the best solution but I don't have a test environment setup.. Can you give this a try?

alphaDev23 commented 5 years ago

I'm using dnsrr for the node service. I do not understand your #2 as the nodes fail with "sleep|no-galera|bash|seed|node ,..." That is it appears that there is no command for task.node. Shouldn't the command for the node service be: "node seed,node" per the suggested compose file?

colinmollenhour commented 5 years ago

The first token is the command, in the case of "node" the second token is the list of DNS names to resolve, so: "node seed,tasks.node" or "node "tasks.seed,tasks.node" to be more consistent.

alphaDev23 commented 5 years ago

Your suggestion #2 resolved the issue. Can you please update the suggested compose file as I spent a significant amount of time seeking a resolution? I appreciate your help in resolving the issue.

colinmollenhour commented 5 years ago

Thanks for reporting and testing!