Can't recover from unexpected node crash unless i restart all cluster

m4hmudftw commented 10 months ago

I'm testing cluster before using it in production.

This is docker-compose context with 5 services :

load-balancer (haproxy),
new_seed_node (that is downed and replaced by node-01 after cluster first sync)
node-01
node-02
node-03

I set up everything up and got : CLUSTER STATUS CHANGED: --status synced, replication works great. However, when i shutdown a random node like node-02 and try to restart it i got this :

node-02           | ===|mysqld.sh|===: Attempting to recover GTID positon...
node-02           | 2023-12-08 14:55:29 0 [Note] WSREP: Recovered position: 2e47e6a9-95d7-11ee-b17b-8b8cf91ae20c:6
node-02           | ===|mysqld.sh|===: --------------------------------------------------
node-02           | ===|mysqld.sh|===: Found view from gvwstate.dat with (3) members: 3 8ada9a75-95d8-11ee-ac7d-4ac19728ffcf 2
node-02           | ===|mysqld.sh|===: Collecting grastate.dat and gvwstate.dat info from other nodes...
node-02           | 2023/12/08 14:58:02 socat[981] E connect(5, AF=2 node-01:3309, 16): Connection refused
node-02           | 2023/12/08 14:58:02 socat[983] E connect(5, AF=2 node-03:3309, 16): Connection refused
node-02           | 2023/12/08 14:58:07 socat[1007] E connect(5, AF=2 node-01:3309, 16): Connection refused
node-02           | 2023/12/08 14:58:07 socat[1009] E connect(5, AF=2 node-03:3309, 16): Connection refused
[...]
[...]
[...] 
node-02           | ===|mysqld.sh|===: Could not communicate with at least 1 other nodes and no nodes are up... Giving up!
node-02           | ===|mysqld.sh|===: Refusing to start since something is seriously wrong..
node-02           | ===|mysqld.sh|===: Touch /var/lib/mysql/new-cluster to force a node to start a new cluster.
node-02           | ===|mysqld.sh|===:
node-02           | ===|mysqld.sh|===:       VvVvVv
node-02           | ===|mysqld.sh|===:       |-  -|    //
node-02           | ===|mysqld.sh|===:  <----|O  O|---<<<
node-02           | ===|mysqld.sh|===:       |  D |    \\
node-02           | ===|mysqld.sh|===:       | () |
node-02           | ===|mysqld.sh|===:        \__/
node-02           | ===|mysqld.sh|===:
node-02           | MariaDB exited with return code (0)
node-02           | # GALERA saved state
node-02           | version: 2.1
node-02           | uuid:    2e47e6a9-95d7-11ee-b17b-8b8cf91ae20c
node-02           | seqno:   -1
node-02           | safe_to_bootstrap: 0
node-02           | Goodbye
node-02 exited with code 0

On other nodes :

node-01           | 2023-12-08 14:58:02 42 [Warning] Access denied for user 'system'@'localhost' (using password: YES)
node-01           | 2023-12-08 14:58:07 43 [Warning] Access denied for user 'system'@'localhost' (using password: YES)

node-03           | 2023-12-08 14:58:02 13 [Warning] Access denied for user 'system'@'localhost' (using password: YES)
node-03           | 2023-12-08 14:58:07 14 [Warning] Access denied for user 'system'@'localhost' (using password: YES)

The only working solution actually is to shutdown all the nodes in the cluster and restart everything but this is not acceptable in a production context.

This what official documentation says about crash recovery :

One Node Disappears from the Cluster

This is the case when one node becomes unavailable due to, for example, power outage, hardware failure, kernel panic, mysqld crash or kill -9 on mysqld pid.

The two remaining nodes notice the connection to node 1 is down and start trying to re-connect to it. After several timeouts, node 1 is removed from the cluster. The quorum is saved (two out of three nodes are up), so no service disruption happens. After it is restarted, node 1 joins automatically, as described in Node 1 Is Gracefully Stopped.

colinmollenhour commented 10 months ago

mysqld.sh launches before MariaDb launches to communicate with other nodes to determine how to go about crash recovery. It checks for health nodes using the http server listening on port 8081 but that apparently didn't return any positive results so it went to the socat check which should fail since the other nodes are not in the boot phase. The command it is running is: curl -f -s -o - http://node-01:8081 and curl -f -s -o - http://node-03:8081. I don't know why these would fail if the other two nodes are up.. I could see there being a very brief period while the other two nodes re-sync the state but then if the node you shut down fails it should keep restarting and succeed by the second time at least.. Please investigate this healthcheck on port 8081 and see why it is failing.

colinmollenhour commented 10 months ago

This is what the boot phase should look like in a healthy scenario. I just updated test.sh so it can easily be tested with just docker and ran a test:

...------======------... MariaDB Galera Start Script ...------======------...
Got NODE_ADDRESS=172.31.0.3
Found Servers: 172.31.0.2
Starting node, connecting to gcomm://172.31.0.2
Tailing /tmp/mysql-console/fifo...
===|mysqld.sh|===: Recovered position from grastate.dat: 9efdb2d0-965b-11ee-8c18-0f892eb91338:3
===|mysqld.sh|===: Safe to bootstrap: 0
Galera Cluster Node status: synced
===|mysqld.sh|===: Node at 172.31.0.2 is healthy!
===|mysqld.sh|===: Found a healthy node! Attempting to join...
===|mysqld.sh|===: ---------------------------------------------------------------
===|mysqld.sh|===: Starting with options: --console --wsrep_sst_auth=xtrabackup:foobar --wsrep-on=ON --wsrep-sst-method=mariabackup --wsrep_cluster_name=cluster --wsrep_cluster_address=gcomm://172.31.0.2 --wsrep_node_address=172.31.0.3:4567 --default-time-zone=+00:00 --log-bin=mysqld-bin --wsrep_start_position=9efdb2d0-965b-11ee-8c18-0f892eb91338:3
2023-12-09  6:29:04 0 [Note] Starting MariaDB 10.11.6-MariaDB-1:10.11.6+maria~ubu2204-log source revision fecd78b83785d5ae96f2c6ff340375be803cd299 as process 40

colinmollenhour commented 10 months ago

Actually to be more similar to your test I used docker kill -s KILL instead of docker stop so the node would not exit gracefully and this results in the state recovery like in your logs, but it still worked for me as expected.

...------======------... MariaDB Galera Start Script ...------======------...
Got NODE_ADDRESS=192.168.0.3
Found Servers: 192.168.0.2
Starting node, connecting to gcomm://192.168.0.2
Tailing /tmp/mysql-console/fifo...
===|mysqld.sh|===: uuid is known but seqno is not...
===|mysqld.sh|===: --------------------------------------------------
===|mysqld.sh|===: Attempting to recover GTID positon...
2023-12-09  6:37:35 0 [Note] WSREP: Recovered position: 3278c727-965d-11ee-a2c3-1776d8bc7ff2:3
===|mysqld.sh|===: --------------------------------------------------
Galera Cluster Node status: synced
===|mysqld.sh|===: Node at 192.168.0.2 is healthy!
===|mysqld.sh|===: Found a healthy node! Attempting to join...
===|mysqld.sh|===: ---------------------------------------------------------------
===|mysqld.sh|===: Starting with options: --console --wsrep_sst_auth=xtrabackup:foobar --wsrep-on=ON --wsrep-sst-method=mariabackup --wsrep_cluster_name=cluster --wsrep_cluster_address=gcomm://192.168.0.2 --wsrep_node_address=192.168.0.3:4567 --default-time-zone=+00:00 --log-bin=mysqld-bin --wsrep_start_position=3278c727-965d-11ee-a2c3-1776d8bc7ff2:3

m4hmudftw commented 9 months ago

Hey Colin, thank you for your time. I can confirm the previous log of node-01 and node-03

~ curl -vvvv node-01:8081
* About to connect() to node-01 port 8081 (#0)
*   Trying x.x.x.x ...
* Connected to node-01 (x.x.x.x) port 8081 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: node-01:8081
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Date: Mon, 11 Dec 2023 17:23:50 GMT
< Content-Length: 105
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host node-01 left intact
Galera Cluster Node status: Error 1045: Access denied for user 'system'@'localhost' (using password: YES)

~ curl -vvvv node-03:8081
* About to connect() to node-01 port 8081 (#0)
*   Trying x.x.x.x ...
* Connected to node-03 (x.x.x.x) port 8081 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: node-01:8081
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Date: Mon, 11 Dec 2023 17:23:50 GMT
< Content-Length: 105
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host node-03 left intact
Galera Cluster Node status: Error 1045: Access denied for user 'system'@'localhost' (using password: YES)

I'm actually trying to understand, i didn't specified any password for 'system' user.

MariaDB [(none)]> SELECT * FROM mysql.user WHERE user = 'system' and host = 'localhost'\G;
*************************** 1. row ***************************
Host: localhost
User: system
Password: *F7B9CS78FB4EF1E9F5AAC77432924A5C16AF565B3
Select_priv: N
Insert_priv: N
Update_priv: N
Delete_priv: N
Create_priv: N
Drop_priv: N
Reload_priv: N
Shutdown_priv: Y
Process_priv: Y
File_priv: N
Grant_priv: N
References_priv: N
Index_priv: N
Alter_priv: N
Show_db_priv: N
Super_priv: N
Create_tmp_table_priv: N
Lock_tables_priv: N
Execute_priv: N
Repl_slave_priv: N
Repl_client_priv: N
Create_view_priv: N
Show_view_priv: N
Create_routine_priv: N
Alter_routine_priv: N
Create_user_priv: N
Event_priv: N
Trigger_priv: N
Create_tablespace_priv: N
Delete_history_priv: N
ssl_type:
ssl_cipher:
x509_issuer:
x509_subject:
max_questions: 0
max_updates: 0
max_connections: 0
max_user_connections: 0
plugin: mysql_native_password
authentication_string: *F7B9CS78FB4EF1E9F5AAC77432924A5C16AF565B3
password_expired: N
is_role: N
default_role:
max_statement_time: 0.000000
1 row in set (0.06 sec)

colinmollenhour commented 9 months ago

It is created right here: https://github.com/colinmollenhour/mariadb-galera-swarm/blob/master/start.sh#L241C72-L241C72

Did you bring your own database directory? EDIT: It's based on the XTRABACKUP_PASSWORD so if you changed that you'd need to rerun the bootstrapping or start fresh.

m4hmudftw commented 9 months ago

Yes, each node has his own datadir mapped to /var/lib/mysql in container, new_seed_node & node-01 use the same dir.

I use /usr/local/lib/startup.sh script to export all passwords.

#!/bin/bash
passfile="/run/secrets/mysql_password.txt

while IFS= read -r line; do
      export "$line"`
done < "$passfile"

mysql_password.txt

MYSQL_PASSWORD=xxxxxxxxxxxx
MYSQL_ROOT_PASSWORD=xxxxxxxxxxxx
SST_MARIABACKUP_PASSWORD=xxxxxxxxxxxx
SYSTEM_PASSWORD=xxxxxxxxxxxx
XTRABACKUP_PASSWORD=xxxxxxxxxxxx

Also i can connect with system or xtrabackup user from container :

root@node-03:/usr/local/bin# mysql -u system -p -h localhost
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 54

This command return 0 root@node-03:/# galera-healthcheck -user=system -password="$SYSTEM_PASSWORD" -port=8081 -availWhenDonor=true -availWhenReadOnly=true -pidfile=/var/run/galera-healthcheck-2.pid

I still have the same error with curl test, i'm using SSL and require_secure_transport=on in my.cnf but it shouldn't be an issue.

colinmollenhour commented 9 months ago

The galera-healthcheck source is here but nothing special: https://github.com/sttts/galera-healthcheck/blob/master/server.go

I don't know why the healthcheck app can't connect - did you try mysql -u system -p -h 127.0.0.1 as well? Not sure which one galera-healthcheck uses - probably depends on whatever is default for go's mysql client.

m4hmudftw commented 9 months ago

It's actually working from container.

root@node-03:/usr/local/bin# mysql -u system -p -h 127.0.0.1
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 72
Server version: 10.11.5-MariaDB-1:10.11.5+maria~ubu2204-log mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>

colinmollenhour commented 9 months ago

Ahh, I didn't read thoroughly before.. Still stumped as to why it doesn't work but a fresh test does. I think it must be something with your server config.. You mentioned SSL, does your config require SSL connections from all clients?

m4hmudftw commented 9 months ago

Yes, SSL is needed in my config, is there a way to adapt the test ?

EDIT : I tried to set require_secure_transport=OFF and now everything works perfectly, i have to find a way to force SSL from client-side or adapt your script, i tried to recompile galera-healthcheck with ?tls=true parameter in func main() of server.go but it still fail.

db, _ := sql.Open("mysql", fmt.Sprintf("%s:%s@/?tls=true", *mysqlUser, *mysqlPassword))

colinmollenhour commented 9 months ago

I suggest only requiring SSL for specific users rather than the entire server. Any other solution is outside the scope of what I personally have the willingness to support on this project.

colinmollenhour / mariadb-galera-swarm

Can't recover from unexpected node crash unless i restart all cluster #113