bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.81k stars 9.1k forks source link

Unable to cleanly restart or recover a mariadb-galera cluster #8721

Closed jfillatre closed 2 years ago

jfillatre commented 2 years ago

Which chart:

mariadb-galera-6.0.6 appVersion 10.6.5 Mariadb container image: docker.io/bitnami/mariadb-galera:10.4.22-debian-10-r20

Describe the bug

Previously with 4.3.3 chart and 10.1.46-debian-10-r17 image tag, I was able to cleanly restart a cluster by scaling down/scaling up the Statefulset. I was also able to recover a unclean cluster failure apply documented procedure, and re applying initial chart configuration after a graceful shutdown.

It can't be done with 10.4.22-debian-10-r20 container image, first node never able to start:

k logs mariadb-galera-0 mariadb-galera -f
mariadb 16:13:47.63 
mariadb 16:13:47.63 Welcome to the Bitnami mariadb-galera container
mariadb 16:13:47.63 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-mariadb-galera
mariadb 16:13:47.63 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-mariadb-galera/issues
mariadb 16:13:47.63 
mariadb 16:13:47.64 INFO  ==> ** Starting MariaDB setup **
mariadb 16:13:47.66 INFO  ==> Validating settings in MYSQL_*/MARIADB_* env vars
mariadb 16:13:47.67 INFO  ==> Initializing mariadb database
mariadb 16:13:47.68 WARN  ==> The mariadb configuration file '/opt/bitnami/mariadb/conf/my.cnf' is not writable or does not exist. Configurations based on environment variables will not be applied for this file.
mariadb 16:13:47.68 INFO  ==> Persisted data detected. Restoring
mariadb 16:13:47.69 INFO  ==> ** MariaDB setup finished! **

mariadb 16:13:47.74 INFO  ==> ** Starting MariaDB **
mariadb 16:13:47.74 INFO  ==> Setting previous boot
2022-01-18 16:13:47 0 [Note] /opt/bitnami/mariadb/sbin/mysqld (mysqld 10.4.22-MariaDB-log) starting as process 1 ...
2022-01-18 16:13:47 0 [Note] WSREP: Loading provider /opt/bitnami/mariadb/lib/libgalera_smm.so initial position: 00000000-0000-0000-0000-000000000000:-1
2022-01-18 16:13:47 0 [Note] WSREP: wsrep_load(): loading provider library '/opt/bitnami/mariadb/lib/libgalera_smm.so'
2022-01-18 16:13:47 0 [Note] WSREP: wsrep_load(): Galera 4.9(rXXXX) by Codership Oy <info@codership.com> loaded successfully.
2022-01-18 16:13:47 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2022-01-18 16:13:47 0 [Note] WSREP: Found saved state: 7cf4e391-7878-11ec-abf9-cbf61c56b744:-1, safe_to_bootstrap: 1
2022-01-18 16:13:47 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 7cf4e391-7878-11ec-abf9-cbf61c56b744
Seqno: 1 - 17
Offset: 1280
Synced: 1
2022-01-18 16:13:47 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 7cf4e391-7878-11ec-abf9-cbf61c56b744, offset: 1280
2022-01-18 16:13:47 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2022-01-18 16:13:47 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2022-01-18 16:13:47 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 1-17
2022-01-18 16:13:47 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (   0/7392 bytes) complete.
2022-01-18 16:13:47 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (7392/7392 bytes) complete.
2022-01-18 16:13:47 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 4/21 locked buffers
2022-01-18 16:13:47 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): free space: 134210824/134217728
2022-01-18 16:13:47 0 [Note] WSREP: Passing config to GCS: base_dir = /bitnami/mariadb/data/; base_host = 10.240.0.124; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /bitnami/mariadb/data/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; 
2022-01-18 16:13:47 0 [Note] WSREP: Start replication
2022-01-18 16:13:47 0 [Note] WSREP: Connecting with bootstrap option: 0
2022-01-18 16:13:47 0 [Note] WSREP: Setting GCS initial position to 00000000-0000-0000-0000-000000000000:-1
2022-01-18 16:13:47 0 [Note] WSREP: protonet asio version 0
2022-01-18 16:13:47 0 [Note] WSREP: Using CRC-32C for message checksums.
2022-01-18 16:13:47 0 [Note] WSREP: backend: asio
2022-01-18 16:13:47 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2022-01-18 16:13:47 0 [Warning] WSREP: access file(/bitnami/mariadb/data//gvwstate.dat) failed(No such file or directory)
2022-01-18 16:13:47 0 [Note] WSREP: restore pc from disk failed
2022-01-18 16:13:47 0 [Note] WSREP: GMCast version 0
2022-01-18 16:13:47 0 [Warning] WSREP: Failed to resolve tcp://mariadb-galera-headless.tpe-dev.svc.cluster.local:4567
2022-01-18 16:13:47 0 [Note] WSREP: (9df5b66f-9bce, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2022-01-18 16:13:47 0 [Note] WSREP: (9df5b66f-9bce, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2022-01-18 16:13:47 0 [Note] WSREP: EVS version 1
2022-01-18 16:13:47 0 [Note] WSREP: gcomm: connecting to group 'galera', peer 'mariadb-galera-headless.tpe-dev.svc.cluster.local:'
2022-01-18 16:13:47 0 [ERROR] WSREP: failed to open gcomm backend connection: 131: No address to connect (FATAL)
     at /bitnami/blacksmith-sandox/libgalera-26.4.9/gcomm/src/gmcast.cpp:connect_precheck():317
2022-01-18 16:13:47 0 [ERROR] WSREP: /bitnami/blacksmith-sandox/libgalera-26.4.9/gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -131 (State not recoverable)
2022-01-18 16:13:47 0 [ERROR] WSREP: /bitnami/blacksmith-sandox/libgalera-26.4.9/gcs/src/gcs.cpp:gcs_open():1633: Failed to open channel 'galera' at 'gcomm://mariadb-galera-headless.tpe-dev.svc.cluster.local': -131 (State not recoverable)
2022-01-18 16:13:47 0 [ERROR] WSREP: gcs connect failed: State not recoverable
2022-01-18 16:13:47 0 [ERROR] WSREP: wsrep::connect(gcomm://mariadb-galera-headless.tpe-dev.svc.cluster.local) failed: 7
2022-01-18 16:13:47 0 [ERROR] Aborting

I can also reproduce with latest 6.2.0 chart and 10.6.5-debian-10-r35. However the 10.6.4-debian-10-r30 is not affected by the issue. I've patched 10.4.22-debian-10-r20 script in this way to fix the issue:

FROM docker.io/bitnami/mariadb-galera:10.6.4-debian-10-r30 as scripts
FROM docker.io/bitnami/mariadb-galera:10.4.22-debian-10-r20

COPY --from=scripts /opt/bitnami/scripts  /opt/bitnami/scripts

To Reproduce

Steps to reproduce the behavior:

  1. Deploy the chart overriding image tag used by statefulset with 10.4.22-debian-10-r20
  2. k scale statefulset mariadb-galera --replicas=3
  3. k scale statefulset mariadb-galera --replicas=0
  4. See error with k logs mariadb-galera-0 mariadb-galera -f

Expected behavior

I must be able to cold restart a gracefully shudown cluster or re apply initial configuration to a manual repaired cluster

Additional context

rafariossaa commented 2 years ago

hi @helletheone , If @cwrau doesn't respond, please open a new issue.

cwrau commented 2 years ago

@cwrau is there a solutions for your problem? because i have the same problem now

Sadly not, we currently have to sometimes force bootstrapping on the last node so it works again 😕