codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
451 stars 176 forks source link

Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

Open freddielunchbird opened 10 years ago

freddielunchbird commented 10 years ago

Hi,

I have a two node cluster: Ubuntu 14.04 PXC 5.6.20-68.0 , galera 3.7(r7f44a18)
xtrabackup 2.2.4

I have a problem to add a third node (same software as for the other two nodes). The node gets stuck in 'joining', but the error log reports 'SYNCED'.

Joiner error log:

140924 17:31:51 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 140924 17:31:51 mysqld_safe Skipping wsrep-recover for empty datadir: /var/lib/mysql 140924 17:31:51 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position 2014-09-24 17:31:51 0 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1' 2014-09-24 17:31:51 15952 [Note] WSREP: Read nil XID from storage engines, skipping position init 2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so' 2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_load(): Galera 3.7(r7f44a18) by Codership Oy info@codership.com loaded successfully. 2014-09-24 17:31:51 15952 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm. 2014-09-24 17:31:51 15952 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat 2014-09-24 17:31:51 15952 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1 2014-09-24 17:31:51 15952 [Note] WSREP: Passing config to GCS: base_host = 10.177.194.12; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.w 2014-09-24 17:31:51 15952 [Note] WSREP: Service thread queue flushed. 2014-09-24 17:31:51 15952 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1 2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_sst_grab() 2014-09-24 17:31:51 15952 [Note] WSREP: Start replication 2014-09-24 17:31:51 15952 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1 2014-09-24 17:31:51 15952 [Note] WSREP: protonet asio version 0 2014-09-24 17:31:51 15952 [Note] WSREP: Using CRC-32C for message checksums. 2014-09-24 17:31:51 15952 [Note] WSREP: backend: asio 2014-09-24 17:31:51 15952 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory) 2014-09-24 17:31:51 15952 [Note] WSREP: restore pc from disk failed 2014-09-24 17:31:51 15952 [Note] WSREP: GMCast version 0 2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567 2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') multicast: , ttl: 1 2014-09-24 17:31:51 15952 [Note] WSREP: EVS version 0 2014-09-24 17:31:51 15952 [Note] WSREP: PC version 0 2014-09-24 17:31:51 15952 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.177.192.227:4567,10.179.2.66:4567,10.177.194.12:4567' 2014-09-24 17:31:51 15952 [Warning] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' points to own listening address, blacklisting 2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping 2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping 2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping 2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping 2014-09-24 17:31:52 15952 [Note] WSREP: declaring 5fea8f01 at tcp://10.177.192.227:4567 stable 2014-09-24 17:31:52 15952 [Note] WSREP: declaring ce5494a7 at tcp://10.179.2.66:4567 stable 2014-09-24 17:31:52 15952 [Note] WSREP: Node 5fea8f01 state prim 2014-09-24 17:31:52 15952 [Note] WSREP: save pc into disk 2014-09-24 17:31:52 15952 [Note] WSREP: gcomm: connected 2014-09-24 17:31:52 15952 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636 2014-09-24 17:31:52 15952 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0) 2014-09-24 17:31:52 15952 [Note] WSREP: Opened channel 'my_wsrep_cluster' 2014-09-24 17:31:52 15952 [Note] WSREP: Waiting for SST to complete. 2014-09-24 17:31:52 15952 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3 2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID. 2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: sent state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 0 (server03) 2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 2 (server06) 2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 1 (server01) 2014-09-24 17:31:52 15952 [Note] WSREP: Quorum results: version = 3, component = PRIMARY, conf_id = 159, members = 2/3 (joined/total), act_id = 102139, last_appl. = -1, protocols = 0/6/3 (gcs/repl/appl), group UUID = 01ec4764-3fea-11e4-887f-da10e62749e4 2014-09-24 17:31:52 15952 [Note] WSREP: Flow-control interval: [28, 28] 2014-09-24 17:31:52 15952 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 102139) 2014-09-24 17:31:52 15952 [Note] WSREP: State transfer required: Group state: 01ec4764-3fea-11e4-887f-da10e62749e4:102139 Local state: 00000000-0000-0000-0000-000000000000:-1 2014-09-24 17:31:52 15952 [Note] WSREP: New cluster view: global state: 01ec4764-3fea-11e4-887f-da10e62749e4:102139, view# 160: Primary, number of nodes: 3, my index: 1, protocol version 3 2014-09-24 17:31:52 15952 [Warning] WSREP: Gap in state sequence. Need state transfer. 2014-09-24 17:31:52 15952 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.177.194.12' --auth 'root:password' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '15952' '' ' WSREP_SST: [INFO] Streaming with xbstream (20140924 17:31:52.963) WSREP_SST: [INFO] Using socat as streamer (20140924 17:31:52.965) WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:31:53.301) 2014-09-24 17:31:53 15952 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.177.194.12:4444/xtrabackup_sst 2014-09-24 17:31:53 15952 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2014-09-24 17:31:53 15952 [Note] WSREP: REPL Protocols: 6 (3, 2) 2014-09-24 17:31:53 15952 [Note] WSREP: Service thread queue flushed. 2014-09-24 17:31:53 15952 [Note] WSREP: Assign initial position for certification: 102139, protocol version: 3 2014-09-24 17:31:53 15952 [Note] WSREP: Service thread queue flushed. 2014-09-24 17:31:53 15952 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (01ec4764-3fea-11e4-887f-da10e62749e4): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable. 2014-09-24 17:31:53 15952 [Note] WSREP: Member 1.0 (server01) requested state transfer from 'any'. Selected 0.0 (server03)(SYNCED) as donor. 2014-09-24 17:31:53 15952 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 102139) 2014-09-24 17:31:53 15952 [Note] WSREP: Requesting state transfer: success, donor: 0 WSREP_SST: [INFO] Proceeding with SST (20140924 17:31:54.298) WSREP_SST: [INFO] Cleaning the existing datadir (20140924 17:31:54.302) removed ‘/var/lib/mysql/gvwstate.dat’ WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:31:54.319) 2014-09-24 17:31:55 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') turning message relay requesting off 2014-09-24 17:33:42 15952 [Note] WSREP: 0.0 (server03): State transfer to 1.0 (server01) complete. 2014-09-24 17:33:42 15952 [Note] WSREP: Member 0.0 (server03) synced with group. WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql/ (20140924 17:33:42.257) WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20140924 17:33:42.264) WSREP_SST: [INFO] Total time on joiner: 0 seconds (20140924 17:33:50.769) WSREP_SST: [INFO] Removing the sst_in_progress file (20140924 17:33:50.775) 2014-09-24 17:33:50 15952 [Note] WSREP: SST complete, seqno: 102137 2014-09-24 17:33:50 15952 [Note] Plugin 'FEDERATED' is disabled. 2014-09-24 17:33:50 15952 [Note] InnoDB: Using atomics to ref count buffer pool pages 2014-09-24 17:33:50 15952 [Note] InnoDB: The InnoDB memory heap is disabled 2014-09-24 17:33:50 15952 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 2014-09-24 17:33:50 15952 [Note] InnoDB: Memory barrier is not used 2014-09-24 17:33:50 15952 [Note] InnoDB: Compressed tables use zlib 1.2.3.4 2014-09-24 17:33:50 15952 [Note] InnoDB: Using Linux native AIO 2014-09-24 17:33:50 15952 [Note] InnoDB: Not using CPU crc32 instructions 2014-09-24 17:33:50 15952 [Note] InnoDB: Initializing buffer pool, size = 908.0M 2014-09-24 17:33:50 15952 [Note] InnoDB: Completed initialization of buffer pool 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18311 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18320 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18312 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18319 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18318 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18317 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18315 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18314 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18316 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18313 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:50 15952 [Note] InnoDB: Highest supported file format is Barracuda. 2014-09-24 17:33:51 15952 [Note] InnoDB: 128 rollback segment(s) are active. 2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18325 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:51 15952 [Note] InnoDB: Waiting for purge to start 2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18326 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18327 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18328 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:33:51 15952 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.20-68.0 started; log sequence number 767569480 2014-09-24 17:33:51 15952 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: f306c24e-4410-11e4-a863-bc764e08d443. 2014-09-24 17:33:51 15952 [Note] RSA private key file not found: /var/lib/mysql//private_key.pem. Some authentication plugins will not work. 2014-09-24 17:33:51 15952 [Note] RSA public key file not found: /var/lib/mysql//public_key.pem. Some authentication plugins will not work. 2014-09-24 17:33:51 15952 [Note] Server hostname (bind-address): '*'; port: 3306 2014-09-24 17:33:51 15952 [Note] IPv6 is available. 2014-09-24 17:33:51 15952 [Note] - '::' resolves to '::'; 2014-09-24 17:33:51 15952 [Note] Server socket created on IP: '::'. 2014-09-24 17:33:51 15952 [Warning] 'proxies_priv' entry '@ root@server04a' ignored in --skip-name-resolve mode. 2014-09-24 17:33:51 15952 [Note] Event Scheduler: Loaded 0 events 2014-09-24 17:33:51 15952 [Note] WSREP: Signalling provider to continue. 2014-09-24 17:33:51 15952 [Note] WSREP: inited wsrep sidno 1 2014-09-24 17:33:51 15952 [Note] WSREP: SST received: 01ec4764-3fea-11e4-887f-da10e62749e4:102137 2014-09-24 17:33:51 15952 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.6.20-68.0-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126 2014-09-24 17:33:51 15952 [Note] WSREP: 1.0 (server01): State transfer from 0.0 (server03) complete. 2014-09-24 17:33:51 15952 [Note] WSREP: Shifting JOINER -> JOINED (TO: 102139) 2014-09-24 17:33:51 15952 [Note] WSREP: Member 1.0 (server01) synced with group. 2014-09-24 17:33:51 15952 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 102139)

mysql> show global status like '%wsrep%'; +------------------------------+---------------------------------------------------------+ | Variable_name | Value | +------------------------------+---------------------------------------------------------+ | wsrep_local_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 | | wsrep_protocol_version | 6 | | wsrep_last_committed | 102137 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 298 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 2 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0.333333 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0.000000 | | wsrep_local_cached_downto | 18446744073709551615 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0.000000 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0.000000 | | wsrep_apply_oooe | 0.000000 | | wsrep_apply_oool | 0.000000 | | wsrep_apply_window | 0.000000 | | wsrep_commit_oooe | 0.000000 | | wsrep_commit_oool | 0.000000 | | wsrep_commit_window | 0.000000 | | wsrep_local_state | 1 | | wsrep_local_state_comment | Joining | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0.000000 | | wsrep_incoming_addresses | 10.177.192.227:3306,10.177.194.12:3306,10.179.2.66:3306 | | wsrep_evs_repl_latency | 0/0/0/0/0 | | wsrep_cluster_conf_id | 160 | | wsrep_cluster_size | 3 | | wsrep_cluster_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy info@codership.com | | wsrep_provider_version | 3.7(r7f44a18) | | wsrep_ready | OFF | +------------------------------+---------------------------------------------------------+ 52 rows in set (0.00 sec)

Running any query on the "joining" node gives "Unknown command", so it is not Synced (and wsrep isn't ready).

But if i have writes on the Cluster (writing to the node that is not the Donor nor the joiner):

140924 17:43:51 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 140924 17:43:51 mysqld_safe Skipping wsrep-recover for empty datadir: /var/lib/mysql 140924 17:43:51 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position 2014-09-24 17:43:51 0 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1' 2014-09-24 17:43:51 27304 [Note] WSREP: Read nil XID from storage engines, skipping position init 2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so' 2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_load(): Galera 3.7(r7f44a18) by Codership Oy info@codership.com loaded successfully. 2014-09-24 17:43:51 27304 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm. 2014-09-24 17:43:51 27304 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat 2014-09-24 17:43:51 27304 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1 2014-09-24 17:43:51 27304 [Note] WSREP: Passing config to GCS: base_host = 10.177.194.12; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.w 2014-09-24 17:43:51 27304 [Note] WSREP: Service thread queue flushed. 2014-09-24 17:43:51 27304 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1 2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_sst_grab() 2014-09-24 17:43:51 27304 [Note] WSREP: Start replication 2014-09-24 17:43:51 27304 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1 2014-09-24 17:43:51 27304 [Note] WSREP: protonet asio version 0 2014-09-24 17:43:51 27304 [Note] WSREP: Using CRC-32C for message checksums. 2014-09-24 17:43:51 27304 [Note] WSREP: backend: asio 2014-09-24 17:43:51 27304 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory) 2014-09-24 17:43:51 27304 [Note] WSREP: restore pc from disk failed 2014-09-24 17:43:51 27304 [Note] WSREP: GMCast version 0 2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567 2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') multicast: , ttl: 1 2014-09-24 17:43:51 27304 [Note] WSREP: EVS version 0 2014-09-24 17:43:51 27304 [Note] WSREP: PC version 0 2014-09-24 17:43:51 27304 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.177.192.227:4567,10.179.2.66:4567,10.177.194.12:4567' 2014-09-24 17:43:51 27304 [Warning] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' points to own listening address, blacklisting 2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping 2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping 2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping 2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping 2014-09-24 17:43:51 27304 [Note] WSREP: declaring 5fea8f01 at tcp://10.177.192.227:4567 stable 2014-09-24 17:43:51 27304 [Note] WSREP: declaring ce5494a7 at tcp://10.179.2.66:4567 stable 2014-09-24 17:43:51 27304 [Note] WSREP: Node 5fea8f01 state prim 2014-09-24 17:43:51 27304 [Note] WSREP: save pc into disk 2014-09-24 17:43:52 27304 [Note] WSREP: gcomm: connected 2014-09-24 17:43:52 27304 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636 2014-09-24 17:43:52 27304 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0) 2014-09-24 17:43:52 27304 [Note] WSREP: Opened channel 'my_wsrep_cluster' 2014-09-24 17:43:52 27304 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3 2014-09-24 17:43:52 27304 [Note] WSREP: Waiting for SST to complete. 2014-09-24 17:43:52 27304 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 58f731ba-4412-11e4-970e-0ff58de80f93 2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: sent state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 0 (server01) 2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 1 (server03) 2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 2 (server06) 2014-09-24 17:43:52 27304 [Note] WSREP: Quorum results: version = 3, component = PRIMARY, conf_id = 161, members = 2/3 (joined/total), act_id = 104692, last_appl. = -1, protocols = 0/6/3 (gcs/repl/appl), group UUID = 01ec4764-3fea-11e4-887f-da10e62749e4 2014-09-24 17:43:52 27304 [Note] WSREP: Flow-control interval: [28, 28] 2014-09-24 17:43:52 27304 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 104692) 2014-09-24 17:43:52 27304 [Note] WSREP: State transfer required: Group state: 01ec4764-3fea-11e4-887f-da10e62749e4:104692 Local state: 00000000-0000-0000-0000-000000000000:-1 2014-09-24 17:43:52 27304 [Note] WSREP: New cluster view: global state: 01ec4764-3fea-11e4-887f-da10e62749e4:104692, view# 162: Primary, number of nodes: 3, my index: 0, protocol version 3 2014-09-24 17:43:52 27304 [Warning] WSREP: Gap in state sequence. Need state transfer. 2014-09-24 17:43:52 27304 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.177.194.12' --auth 'root:password' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '27304' '' ' WSREP_SST: [INFO] Streaming with xbstream (20140924 17:43:52.285) WSREP_SST: [INFO] Using socat as streamer (20140924 17:43:52.288) WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:43:52.648) 2014-09-24 17:43:52 27304 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.177.194.12:4444/xtrabackup_sst 2014-09-24 17:43:52 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2014-09-24 17:43:52 27304 [Note] WSREP: REPL Protocols: 6 (3, 2) 2014-09-24 17:43:52 27304 [Note] WSREP: Service thread queue flushed. 2014-09-24 17:43:52 27304 [Note] WSREP: Assign initial position for certification: 104692, protocol version: 3 2014-09-24 17:43:52 27304 [Note] WSREP: Service thread queue flushed. 2014-09-24 17:43:52 27304 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (01ec4764-3fea-11e4-887f-da10e62749e4): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable. 2014-09-24 17:43:52 27304 [Note] WSREP: Member 0.0 (server01) requested state transfer from 'any'. Selected 1.0 (server03)(SYNCED) as donor. 2014-09-24 17:43:52 27304 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 104781) 2014-09-24 17:43:52 27304 [Note] WSREP: Requesting state transfer: success, donor: 1 WSREP_SST: [INFO] Proceeding with SST (20140924 17:43:53.812) WSREP_SST: [INFO] Cleaning the existing datadir (20140924 17:43:53.818) removed ‘/var/lib/mysql/gvwstate.dat’ WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:43:53.844) 2014-09-24 17:43:55 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') turning message relay requesting off 2014-09-24 17:45:43 27304 [Note] WSREP: 1.0 (server03): State transfer to 0.0 (server01) complete. 2014-09-24 17:45:43 27304 [Note] WSREP: Member 1.0 (server03) synced with group. WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql/ (20140924 17:45:43.511) WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20140924 17:45:43.520) WSREP_SST: [INFO] Total time on joiner: 0 seconds (20140924 17:45:55.201) WSREP_SST: [INFO] Removing the sst_in_progress file (20140924 17:45:55.209) 2014-09-24 17:45:55 27304 [Note] WSREP: SST complete, seqno: 113877 2014-09-24 17:45:55 27304 [Note] Plugin 'FEDERATED' is disabled. 2014-09-24 17:45:55 27304 [Note] InnoDB: Using atomics to ref count buffer pool pages 2014-09-24 17:45:55 27304 [Note] InnoDB: The InnoDB memory heap is disabled 2014-09-24 17:45:55 27304 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 2014-09-24 17:45:55 27304 [Note] InnoDB: Memory barrier is not used 2014-09-24 17:45:55 27304 [Note] InnoDB: Compressed tables use zlib 1.2.3.4 2014-09-24 17:45:55 27304 [Note] InnoDB: Using Linux native AIO 2014-09-24 17:45:55 27304 [Note] InnoDB: Not using CPU crc32 instructions 2014-09-24 17:45:55 27304 [Note] InnoDB: Initializing buffer pool, size = 908.0M 2014-09-24 17:45:55 27304 [Note] InnoDB: Completed initialization of buffer pool 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29744 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29746 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29752 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29745 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29753 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29751 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29750 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29748 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29747 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29749 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Note] InnoDB: Highest supported file format is Barracuda. 2014-09-24 17:45:55 27304 [Note] InnoDB: 128 rollback segment(s) are active. 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29759 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Note] InnoDB: Waiting for purge to start 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29760 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29761 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29762 nice to 0 failed, current nice 10, errno 13 2014-09-24 17:45:55 27304 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.20-68.0 started; log sequence number 784149207 2014-09-24 17:45:55 27304 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: a2d4c7d5-4412-11e4-a92d-bc764e08d443. 2014-09-24 17:45:55 27304 [Note] RSA private key file not found: /var/lib/mysql//private_key.pem. Some authentication plugins will not work. 2014-09-24 17:45:55 27304 [Note] RSA public key file not found: /var/lib/mysql//public_key.pem. Some authentication plugins will not work. 2014-09-24 17:45:55 27304 [Note] Server hostname (bind-address): '*'; port: 3306 2014-09-24 17:45:55 27304 [Note] IPv6 is available. 2014-09-24 17:45:55 27304 [Note] - '::' resolves to '::'; 2014-09-24 17:45:55 27304 [Note] Server socket created on IP: '::'. 2014-09-24 17:45:55 27304 [Warning] 'proxies_priv' entry '@ root@server04a' ignored in --skip-name-resolve mode. 2014-09-24 17:45:56 27304 [Note] Event Scheduler: Loaded 0 events 2014-09-24 17:45:56 27304 [Note] WSREP: Signalling provider to continue. 2014-09-24 17:45:56 27304 [Note] WSREP: inited wsrep sidno 1 2014-09-24 17:45:56 27304 [Note] WSREP: SST received: 01ec4764-3fea-11e4-887f-da10e62749e4:113877 2014-09-24 17:45:56 27304 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.6.20-68.0-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126 2014-09-24 17:45:56 27304 [Note] WSREP: 0.0 (server01): State transfer from 1.0 (server03) complete. 2014-09-24 17:45:56 27304 [Note] WSREP: Shifting JOINER -> JOINED (TO: 115797) 2014-09-24 17:45:57 27304 [Note] WSREP: Member 0.0 (server01) synced with group. 2014-09-24 17:45:57 27304 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 115972) 2014-09-24 17:45:57 27304 [Note] WSREP: Synchronized with group, ready for connections 2014-09-24 17:45:57 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

mysql> show global status like '%wsrep%'; +------------------------------+---------------------------------------------------------+ | Variable_name | Value | +------------------------------+---------------------------------------------------------+ | wsrep_local_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 | | wsrep_protocol_version | 6 | | wsrep_last_committed | 117348 | | wsrep_replicated | 14 | | wsrep_replicated_bytes | 3814 | | wsrep_repl_keys | 14 | | wsrep_repl_keys_bytes | 434 | | wsrep_repl_data_bytes | 2484 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 12674 | | wsrep_received_bytes | 16708102 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 2 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0.008696 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 11118 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 4917.729604 | | wsrep_local_cached_downto | 104693 | | wsrep_flow_control_paused_ns | 541185 | | wsrep_flow_control_paused | 0.000004 | | wsrep_flow_control_sent | 1 | | wsrep_flow_control_recv | 1 | | wsrep_cert_deps_distance | 1301.121760 | | wsrep_apply_oooe | 0.651109 | | wsrep_apply_oool | 0.005186 | | wsrep_apply_window | 2.845866 | | wsrep_commit_oooe | 0.000000 | | wsrep_commit_oool | 0.000000 | | wsrep_commit_window | 2.267070 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 143 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0.394991 | | wsrep_incoming_addresses | 10.177.194.12:3306,10.177.192.227:3306,10.179.2.66:3306 | | wsrep_evs_repl_latency | 0.000593955/0.00157661/0.00837575/0.00108117/116 | | wsrep_cluster_conf_id | 162 | | wsrep_cluster_size | 3 | | wsrep_cluster_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 0 | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy info@codership.com | | wsrep_provider_version | 3.7(r7f44a18) | | wsrep_ready | ON | +------------------------------+---------------------------------------------------------+ 52 rows in set (0.00 sec)

freddielunchbird commented 10 years ago

I think it has to do with xtrabackup 2.2.4 somehow, with xtrabackup 2.2.3 i dont experience any problems. Filed a bug report here: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1373796

dirtysalt commented 10 years ago

To my intuition, maybe it's a galera bug.

Although joiner logs says it's in SYNCED state, it's not actually. There are several layers in Galera(GCOMM, GCS, REPLICATOR levels). That message means it's in SYNCED state at GCS level. But it's not in SYNCED state at REPLICATOR level. Because the status shows

| wsrep_local_state_comment | Joining |
| wsrep_ready | OFF |

And normally at the tail of log, there should be

2014-09-24 17:45:57 27304 [Note] WSREP: Synchronized with group, ready for connections
2014-09-24 17:45:57 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

which is the true evidence that it's in SYNCED state at REPLICATOR level. However those messages can not be found in joiner log.

Could you reproduce this bug every time when using xtrabackup 2.2.4 ?

castro1688 commented 10 years ago

I'm having a very similar issue with xtrabackup 2.2.5 and 2.2.4. I'm setting the wsrep_sst_method to rsync as a work around. Seems like this bug isn't getting much attention here or in launchpad. Rolling back xtrabackup to 2.2.3 has also corrected the issue.

castro1688 commented 10 years ago

I've added this bug to xtrabackup in launchpad: https://bugs.launchpad.net/percona-xtrabackup/+bug/1378138

freddielunchbird commented 10 years ago

@castro1688 : Great! It looks like they are starting to work on the bug reports now so let's see what happens.

@dirtysalt : It is reproducable almost every time. If you stop a node, remove the filesystem of that node, start the node (JOINER) then this is very likely to happen.

dbadapt commented 10 years ago

More information here: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1373796/comments/10

The problem may be in gcs/src/gcs.cpp: gcs_recv_thread(void *arg)

I believe the wsrep 'synced' callback is not being sent back to the application.

joffrey92 commented 10 years ago

Hi,

I had the issue with xtrabackup 2.2.6 debian today. http://pastebin.com/G3hGbcLx

Thanks, Joffrey

ronin13 commented 10 years ago

@joffrey92 Thanks for the backtrace. If you still have the coredump, can you provide 'thread apply all bt full' from it? (and also install debuginfo packages for mysqld if possible).

ronin13 commented 9 years ago

Is https://github.com/codership/galera/issues/133 related to this issue? ie. are the race conditions responsible the same ones.

mike-sirs commented 9 years ago

hello i have a xtrabackup version 2.2.8 based on MySQL server 5.6.22 Linux (x86_64) (revision id: )

problem still 2015-02-10 20:25:57 8459 [Note] WSREP: 2.0 (pnode1): State transfer from 0.0 (pnode0) complete. 2015-02-10 20:25:57 8459 [Note] WSREP: Shifting JOINER -> JOINED (TO: 124879) 2015-02-10 20:25:57 8459 [Note] WSREP: Member 2.0 (pnode1) synced with group. 2015-02-10 20:25:57 8459 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 124879) 2015-02-10 20:25:57 8459 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.6.21-70.1-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel70.1, Revision 938, WSREP version 25.8, wsrep_25.8.r4150

show status like 'wsrep%'; wsrep_local_state_comment | Joining

castro1688 commented 9 years ago

Well this still exists.. I'm reproducing with xtrabackup 2.2.10

dbadapt commented 9 years ago

Hi Castro, This bug is my 'big one that gets away'. I am thinking of naming it. Something like "Ol' Oscar" or Goliath... Seriously, I am still unsure as to whether it's an issue with the Galera engine or the WSREP interface on the mysqld side. I have seen it and been able to re-produce it but it quickly disappears when one of the components is slightly changed.

You can always tell this bug by the missing log entry from the WSREP side after SST:

... [Note] WSREP: Synchronized with group, ready for connections

This final message is sent by the callback to sql/wsrep_mysqld.cc:wsrep_synced_cb from Galera and it doesn't happen sometimes for some unknown reason. Perhaps the thread is being clobbered or there is a conditional hiccup that we are not seeing (I've looked).

My belief is that it is triggered by a very subtle I/O bound race condition between components (either fast or slow) and that changing one component, changes the dynamic ever so slightly that it disappears. I have seen it on VM's and unsuccessfully tried reproduce on bare metal. I have also tried to change VM processor counts and mysqld parameters without success.

castro1688 commented 9 years ago

So i've really been digging today.. In addition to missing the "Syncronized with group" message, i've also noticed the full "heading" is missing when you log into mysql.

I'm starting to think the problem could be a result of the donor being a node that was started with bootstrap-pxc and was never restarted. I pointed wsrep_sst_donor to my other two "working" nodes. One started with bootstrap-pxc, and the other with a normal start. I could reproduce the problem only with bootstrap-pxc node. Thoughts?

Broken - Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 34 Server version: 5.6.22-72.0-56-log

Working - Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 34 Server version: 5.6.22-72.0-56-log Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150

castro1688 commented 9 years ago

David, one last question.. Do you have a work around? I downgraded xtrabackup to 2.2.3 to get by this with PXC 5.6.20. Now i'm trying to upgrade to 5.6.22 and it requires xtrabackup > 2.2.5.

soluwalana commented 9 years ago

I'm having a similar issue when attempting to deploy the cluster to SSD, is there any known workaround for this issue?

samos123 commented 9 years ago

I'm using 2.1.8-1 of ubuntu 14.04 any workaround?

luksch commented 9 years ago

now this bug hit me too. 3 node cluster that was running fine before. Now after a server crash I can't get the node to properly join. the wsrep_local_state_comment stays on "Joined" forever.

ayurchen commented 9 years ago
  1. is the cluster idle or running a load?
  2. is comment "Joined" or "Joining"?
luksch commented 9 years ago

It is a production node under quite some load (about 50MB write load per minute). Most writes are supposed to go to this node, but now the two remaining and working nodes cover it.

tomashejatko commented 8 years ago

Hi, in my setup this is always reproducible - I'm not able to get third member back into the cluster. I'm running Galera inside Docker container on VM, which is on SSD

My versions are :

innobackupex version 2.3.3 Linux (x86_64) (revision id: 525ca7d) Server version: 5.6.26-74.0-56 Percona XtraDB Cluster (GPL), Release rel74.0, Revision 624ef81, WSREP version 25.12, wsrep_25.12

ayurchen commented 8 years ago

Whoever is experiencing this problem with PXC or MariaDB, could you try with Codership's binaries: http://galeracluster.com/downloads/

SylvainA77 commented 8 years ago

i am also experiencing the very same problem at a client's using rsync as an sst method. so it does not seem to be xtrabackup related. i will try and provide you with some logs asap

in my case the cluster is stalled and not responding to write operations despite the node1 stating it is synced. it still serves read.

version are : MariaDB 5.5.41, galera provider version : 3.5 (rXXXX) from RHEL packages on OpenStack.

in the logs it is stated as JOINING -> JOINED as the last line, but the wsrep_local_state is Joining. what we also saw with a netstat -an is this line : tcp 90002 0 local-ip:4567 ist_node_ip ESTABLISHED

as if the MariaDB node was unable to read anything from the socket.

ethaniel commented 8 years ago

I am done with MariaDB and it's cluster problems. Successfully moved to Percona Cluster and haven't regretted it even once. Setup was supereasy. And cluster commits work WAY faster. Same galera engine.

rasschaert commented 8 years ago

@ethaniel, the reporter of this very issue was also running Percona XtraDB Cluster, so I'm not sure how you think you're impervious to this bug. You even said yourself it's the same Galera engine, which is where this bug (still) is.

But hey if switching from MariaDB to PXC helps you sleep at night, good for you.

Please post your snark on Twitter instead of in Github issues next time.

fraenki commented 3 years ago

Ugh, this is a really old bug report. I'm seeing the same issue the original reporter described back in 2014, but on mysqld-wsrep 5.7-5.7.33-25.25. Is anyone else still experiencing this and found a workaround?

The debug log output is interesting:

2021-05-21T19:00:07.370469+01:00 0 [Note] WSREP: SST complete, seqno: 40917804340
...
2021-05-21T19:00:48.163817+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/replicator_str.cpp:sst_received():38: SST received: ac673ca1-97f8-11e9-a0b3-5b765e435c14:40917804340
2021-05-21T19:00:48.163936+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/saved_state.cpp:set():187: Not writing state: unsafe counter is 1
2021-05-21T19:00:48.165466+01:00 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.33-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - (GPL), wsrep_25.25
2021-05-21T19:00:48.165469+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/replicator_str.cpp:request_state_transfer():719: Installed new state: ac673ca1-97f8-11e9-a0b3-5b765e435c14:40917804340
2021-05-21T19:00:48.165516+01:00 3 [Note] WSREP: SYSTEM_THREAD_COMPRESS_GTID_TABLE declared as non wsrep_on
2021-05-21T19:00:48.165541+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/saved_state.cpp:set():187: Not writing state: unsafe counter is 1
2021-05-21T19:00:48.167212+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():801: 0.0 (galera6): State transfer from 1.0 (galera5) complete.
2021-05-21T19:00:48.167247+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_handle_state_change():1051: Got 'JOIN' dated 40917804340
2021-05-21T19:00:48.167260+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_fc_cont_end():524: SKIPPED FC_CONT sending: stop_sent = 0
2021-05-21T19:00:48.167271+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_shift_state():635: Shifting JOINER -> JOINED (TO: 40920891251)
2021-05-21T19:00:48.167281+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_become_joined():806: Become joined, FC offset 13917607
2021-05-21T19:00:48.167291+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_send_sync_end():568: SENDING SYNC
2021-05-21T19:00:48.168191+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs_group.cpp:gcs_group_handle_sync_msg():847: Member 0.0 (galera6) synced with group.
2021-05-21T19:00:48.168225+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_handle_state_change():1051: Got 'SYNC' dated 0
2021-05-21T19:00:48.168239+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_shift_state():635: Shifting JOINED -> SYNCED (TO: 40920891253)
2021-05-21T19:00:48.168252+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_become_synced():826: Become synced, FC offset 13917556
2021-05-21T19:01:09.711228+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/certification.cpp:append_trx():1068: seqno gap, position: 40907017112 trx seqno 40917804341
2021-05-21T19:19:01.143739+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcomm/src/evs_proto.cpp:send_user():1514: adjusted seq range to: 1

It looks like immediately after SST is complete, the WSREP provider shifts state from JOINER to JOINED and then to SYNCED. But apparently the node is NOT synced: wsrep_local_state_comment still says "Joining". And wsrep_local_recv_queue is always increasing and never decreasing.

The message ready for connections is also not true, because mysqld will still deny any query: ERROR 1047 (08S01): WSREP has not yet prepared node for application use

So the node is stuck in this state forever.

@temeo Could you possibly have a look? Thank you :)

DanielWeeber commented 3 years ago

Same problem here with 10.6.5 (galera 4 26.4.9) Except that wsrep_local_recv_queue stays empty.

fraenki commented 3 years ago

So the node is stuck in this state forever.

To follow-up on my own report: I was able to fix this issue. In my case it was triggered by some super aggressive MySQL tuning I've applied to that node. I couldn't nail it down to a specific setting (because this would be too time-consuming), but reverting back to a MySQL config that is closer to the default solved this issue for me.

I'm not saying that there is no bug, but if there is a bug, then it's probably only triggered with non-default configurations. So it's probably worth to try again with a different MySQL/MariaDB config.

tchernomax commented 1 year ago

In our test environment, the problem was triggered by wsrep_provider_options="gcache.size=2G" (on hosts with 4GiB of RAM). Removing the setting (using the default of 128MiB) fixed the problem.