Closed m4hmudftw closed 9 months ago
mysqld.sh launches before MariaDb launches to communicate with other nodes to determine how to go about crash recovery. It checks for health nodes using the http server listening on port 8081 but that apparently didn't return any positive results so it went to the socat check which should fail since the other nodes are not in the boot phase. The command it is running is:
curl -f -s -o - http://node-01:8081
and curl -f -s -o - http://node-03:8081
. I don't know why these would fail if the other two nodes are up.. I could see there being a very brief period while the other two nodes re-sync the state but then if the node you shut down fails it should keep restarting and succeed by the second time at least.. Please investigate this healthcheck on port 8081 and see why it is failing.
This is what the boot phase should look like in a healthy scenario. I just updated test.sh
so it can easily be tested with just docker and ran a test:
...------======------... MariaDB Galera Start Script ...------======------...
Got NODE_ADDRESS=172.31.0.3
Found Servers: 172.31.0.2
Starting node, connecting to gcomm://172.31.0.2
Tailing /tmp/mysql-console/fifo...
===|mysqld.sh|===: Recovered position from grastate.dat: 9efdb2d0-965b-11ee-8c18-0f892eb91338:3
===|mysqld.sh|===: Safe to bootstrap: 0
Galera Cluster Node status: synced
===|mysqld.sh|===: Node at 172.31.0.2 is healthy!
===|mysqld.sh|===: Found a healthy node! Attempting to join...
===|mysqld.sh|===: ---------------------------------------------------------------
===|mysqld.sh|===: Starting with options: --console --wsrep_sst_auth=xtrabackup:foobar --wsrep-on=ON --wsrep-sst-method=mariabackup --wsrep_cluster_name=cluster --wsrep_cluster_address=gcomm://172.31.0.2 --wsrep_node_address=172.31.0.3:4567 --default-time-zone=+00:00 --log-bin=mysqld-bin --wsrep_start_position=9efdb2d0-965b-11ee-8c18-0f892eb91338:3
2023-12-09 6:29:04 0 [Note] Starting MariaDB 10.11.6-MariaDB-1:10.11.6+maria~ubu2204-log source revision fecd78b83785d5ae96f2c6ff340375be803cd299 as process 40
Actually to be more similar to your test I used docker kill -s KILL
instead of docker stop
so the node would not exit gracefully and this results in the state recovery like in your logs, but it still worked for me as expected.
...------======------... MariaDB Galera Start Script ...------======------...
Got NODE_ADDRESS=192.168.0.3
Found Servers: 192.168.0.2
Starting node, connecting to gcomm://192.168.0.2
Tailing /tmp/mysql-console/fifo...
===|mysqld.sh|===: uuid is known but seqno is not...
===|mysqld.sh|===: --------------------------------------------------
===|mysqld.sh|===: Attempting to recover GTID positon...
2023-12-09 6:37:35 0 [Note] WSREP: Recovered position: 3278c727-965d-11ee-a2c3-1776d8bc7ff2:3
===|mysqld.sh|===: --------------------------------------------------
Galera Cluster Node status: synced
===|mysqld.sh|===: Node at 192.168.0.2 is healthy!
===|mysqld.sh|===: Found a healthy node! Attempting to join...
===|mysqld.sh|===: ---------------------------------------------------------------
===|mysqld.sh|===: Starting with options: --console --wsrep_sst_auth=xtrabackup:foobar --wsrep-on=ON --wsrep-sst-method=mariabackup --wsrep_cluster_name=cluster --wsrep_cluster_address=gcomm://192.168.0.2 --wsrep_node_address=192.168.0.3:4567 --default-time-zone=+00:00 --log-bin=mysqld-bin --wsrep_start_position=3278c727-965d-11ee-a2c3-1776d8bc7ff2:3
Hey Colin, thank you for your time. I can confirm the previous log of node-01
and node-03
~ curl -vvvv node-01:8081
* About to connect() to node-01 port 8081 (#0)
* Trying x.x.x.x ...
* Connected to node-01 (x.x.x.x) port 8081 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: node-01:8081
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Date: Mon, 11 Dec 2023 17:23:50 GMT
< Content-Length: 105
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host node-01 left intact
Galera Cluster Node status: Error 1045: Access denied for user 'system'@'localhost' (using password: YES)
~ curl -vvvv node-03:8081
* About to connect() to node-01 port 8081 (#0)
* Trying x.x.x.x ...
* Connected to node-03 (x.x.x.x) port 8081 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: node-01:8081
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Date: Mon, 11 Dec 2023 17:23:50 GMT
< Content-Length: 105
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host node-03 left intact
Galera Cluster Node status: Error 1045: Access denied for user 'system'@'localhost' (using password: YES)
I'm actually trying to understand, i didn't specified any password for 'system' user.
MariaDB [(none)]> SELECT * FROM mysql.user WHERE user = 'system' and host = 'localhost'\G;
*************************** 1. row ***************************
Host: localhost
User: system
Password: *F7B9CS78FB4EF1E9F5AAC77432924A5C16AF565B3
Select_priv: N
Insert_priv: N
Update_priv: N
Delete_priv: N
Create_priv: N
Drop_priv: N
Reload_priv: N
Shutdown_priv: Y
Process_priv: Y
File_priv: N
Grant_priv: N
References_priv: N
Index_priv: N
Alter_priv: N
Show_db_priv: N
Super_priv: N
Create_tmp_table_priv: N
Lock_tables_priv: N
Execute_priv: N
Repl_slave_priv: N
Repl_client_priv: N
Create_view_priv: N
Show_view_priv: N
Create_routine_priv: N
Alter_routine_priv: N
Create_user_priv: N
Event_priv: N
Trigger_priv: N
Create_tablespace_priv: N
Delete_history_priv: N
ssl_type:
ssl_cipher:
x509_issuer:
x509_subject:
max_questions: 0
max_updates: 0
max_connections: 0
max_user_connections: 0
plugin: mysql_native_password
authentication_string: *F7B9CS78FB4EF1E9F5AAC77432924A5C16AF565B3
password_expired: N
is_role: N
default_role:
max_statement_time: 0.000000
1 row in set (0.06 sec)
It is created right here: https://github.com/colinmollenhour/mariadb-galera-swarm/blob/master/start.sh#L241C72-L241C72
Did you bring your own database directory? EDIT: It's based on the XTRABACKUP_PASSWORD so if you changed that you'd need to rerun the bootstrapping or start fresh.
Yes, each node has his own datadir mapped to /var/lib/mysql
in container, new_seed_node & node-01 use the same dir.
I use /usr/local/lib/startup.sh
script to export all passwords.
#!/bin/bash
passfile="/run/secrets/mysql_password.txt
while IFS= read -r line; do
export "$line"`
done < "$passfile"
mysql_password.txt
MYSQL_PASSWORD=xxxxxxxxxxxx
MYSQL_ROOT_PASSWORD=xxxxxxxxxxxx
SST_MARIABACKUP_PASSWORD=xxxxxxxxxxxx
SYSTEM_PASSWORD=xxxxxxxxxxxx
XTRABACKUP_PASSWORD=xxxxxxxxxxxx
Also i can connect with system
or xtrabackup
user from container :
root@node-03:/usr/local/bin# mysql -u system -p -h localhost
Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 54
This command return 0
root@node-03:/# galera-healthcheck -user=system -password="$SYSTEM_PASSWORD" -port=8081 -availWhenDonor=true -availWhenReadOnly=true -pidfile=/var/run/galera-healthcheck-2.pid
I still have the same error with curl test, i'm using SSL and require_secure_transport=on
in my.cnf
but it shouldn't be an issue.
The galera-healthcheck source is here but nothing special: https://github.com/sttts/galera-healthcheck/blob/master/server.go
I don't know why the healthcheck app can't connect - did you try mysql -u system -p -h 127.0.0.1
as well? Not sure which one galera-healthcheck uses - probably depends on whatever is default for go's mysql client.
It's actually working from container.
root@node-03:/usr/local/bin# mysql -u system -p -h 127.0.0.1
Enter password:
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 72
Server version: 10.11.5-MariaDB-1:10.11.5+maria~ubu2204-log mariadb.org binary distribution
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]>
Ahh, I didn't read thoroughly before.. Still stumped as to why it doesn't work but a fresh test does. I think it must be something with your server config.. You mentioned SSL, does your config require SSL connections from all clients?
Yes, SSL is needed in my config, is there a way to adapt the test ?
EDIT : I tried to set require_secure_transport=OFF
and now everything works perfectly, i have to find a way to force SSL from client-side or adapt your script, i tried to recompile galera-healthcheck
with ?tls=true
parameter in func main()
of server.go
but it still fail.
db, _ := sql.Open("mysql", fmt.Sprintf("%s:%s@/?tls=true", *mysqlUser, *mysqlPassword))
I suggest only requiring SSL for specific users rather than the entire server. Any other solution is outside the scope of what I personally have the willingness to support on this project.
I'm testing cluster before using it in production.
This is docker-compose context with 5 services :
I set up everything up and got :
CLUSTER STATUS CHANGED: --status synced
, replication works great. However, when i shutdown a random node like node-02 and try to restart it i got this :On other nodes :
The only working solution actually is to shutdown all the nodes in the cluster and restart everything but this is not acceptable in a production context.
This what official documentation says about crash recovery :