ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 579 forks source link

Strange issue, node joins the cluster and then Abort #1837

Open IRCGeek opened 1 year ago

IRCGeek commented 1 year ago

Please advice me as per below logs I am unable to find any reason

Distributor ID: Ubuntu Description: Ubuntu 20.04.5 LTS Release: 20.04 Codename: focal

Linux instance-20220523-1749 5.15.0-1016-oracle #20~20.04.1-Ubuntu SMP Mon Aug 8 07:30:37 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

mysql Ver 15.1 Distrib 10.3.37-MariaDB, for debian-linux-gnu (aarch64) using readline 5.2

2023-01-22 6:26:52 0 [Note] WSREP: Read nil XID from storage engines, skipping position init 2023-01-22 6:26:52 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so' 2023-01-22 6:26:52 0 [Note] WSREP: wsrep_load(): Galera 3.29(ra60e019) by Codership Oy info@codership.com loaded successfully. 2023-01-22 6:26:52 0 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm. 2023-01-22 6:26:52 0 [Note] WSREP: Found saved state: f1e1e4a1-9a0f-11ed-9c0f-1be97a0e7b5b:-1, safe_to_bootstrap: 1 2023-01-22 6:26:52 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.0.1.169; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; p 2023-01-22 6:26:52 0 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1 2023-01-22 6:26:52 0 [Note] WSREP: wsrep_sst_grab() 2023-01-22 6:26:52 0 [Note] WSREP: Start replication 2023-01-22 6:26:52 0 [Note] WSREP: Setting initial position to f1e1e4a1-9a0f-11ed-9c0f-1be97a0e7b5b:0 2023-01-22 6:26:52 0 [Note] WSREP: protonet asio version 0 2023-01-22 6:26:52 0 [Note] WSREP: Using CRC-32C for message checksums. 2023-01-22 6:26:52 0 [Note] WSREP: backend: asio 2023-01-22 6:26:52 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 2023-01-22 6:26:52 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory) 2023-01-22 6:26:52 0 [Note] WSREP: restore pc from disk failed 2023-01-22 6:26:52 0 [Note] WSREP: GMCast version 0 2023-01-22 6:26:52 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567 2023-01-22 6:26:52 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') multicast: , ttl: 1 2023-01-22 6:26:52 0 [Note] WSREP: EVS version 0 2023-01-22 6:26:52 0 [Note] WSREP: gcomm: connecting to group 'MariaDB Galera Cluster', peer '35.212.132.67:,85.122.127.235:,192.3.91.168:,10.0.1.169:' 2023-01-22 6:26:52 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://10.0.1.169:4567 2023-01-22 6:26:52 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') connection established to 6ab548fb tcp://85.122.127.235:4567 2023-01-22 6:26:52 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 2023-01-22 6:26:52 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') connection established to 7191d038 tcp://192.3.91.168:4567 2023-01-22 6:26:53 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') connection established to 5149ecf7 tcp://35.212.132.67:4567 2023-01-22 6:26:53 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') connection established to 7191d038 tcp://192.3.91.168:4567 2023-01-22 6:26:53 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') connection established to 5149ecf7 tcp://35.212.132.67:4567 2023-01-22 6:26:54 0 [Note] WSREP: declaring 5149ecf7 at tcp://35.212.132.67:4567 stable 2023-01-22 6:26:54 0 [Note] WSREP: declaring 6ab548fb at tcp://85.122.127.235:4567 stable 2023-01-22 6:26:54 0 [Note] WSREP: declaring 7191d038 at tcp://192.3.91.168:4567 stable 2023-01-22 6:26:55 0 [Note] WSREP: view(view_id(NON_PRIM,5149ecf7,2792) memb { 5149ecf7,0 6ab548fb,0 7191d038,0 c273183b,0 } joined { } left { } partitioned { 07d1e878,0 77a052d6,0 9e708b0d,0 a1d8493f,0 a79d17a2,0 b2984c45,0 b60181c9,0 bad3a406,0 c401108d,0 d392217d,0 dc8cc78b,0 }) 2023-01-22 6:26:56 0 [Note] WSREP: (c273183b, 'tcp://0.0.0.0:4567') turning message relay requesting off 2023-01-22 6:27:23 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():160 2023-01-22 6:27:23 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out) 2023-01-22 6:27:23 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1457: Failed to open channel 'MariaDB Galera Cluster' at 'gcomm://35.212.132.67,85.122.127.235,192.3.91.168,10.0.1.169': -110 (Connection timed out) 2023-01-22 6:27:23 0 [ERROR] WSREP: gcs connect failed: Connection timed out 2023-01-22 6:27:23 0 [ERROR] WSREP: wsrep::connect(gcomm://35.x.x.x,85.122.x.x,192.3.x.x,10.0.x.x) failed: 7 2023-01-22 6:27:23 0 [ERROR] Aborting

dciabrin commented 1 year ago

From the few logs here, it looks like the joiner could reach one of the four nodes in the gcomm from which it could join the running cluster. But onced it initiated the connection, it determined that it joined a partition of the cluster which lost quorum (only 4 nodes out of 15?). It probably refused to carry on from that point onward.