Nodes no longer starting after restoring a backup

RoyvanEmpel commented 11 months ago

Hello,

I want to setup a galera cluster using the galera manager. I have setup a cluster with 3 servers and they are all in sync. Now i want to restore a mariabackup from my curreny mariadb server. How should I go about this?

I tried the following steps:

Shutting down all the nodes.
using the mariabackup --copy-back --target-dir=/path/to/backupdir on the first node
Recovering the cluster
Saw that it didn't work.
Deleted all the nodes.
Addes node 1
Used the mariabackup command again.
Started node 1 :)
Added node 2, Error.
Syncing doesn't happen and node 1 is switching between "Donor" and "Synced"
Removed node 2.
Readded node 1 again.
Node 1 doesn't start.

Since I made backups of my /var/lib/mysql folder i restored that with the right permissions but mariadb still doesn't start. I tried removing /var/lib/mysql and /etc/mysql but that doesn't fix anything.

I am now stuck at my node 1 and 2 that no longer start with the following error:

Aug 18, 2023 00:31:50 | galera-manager | checking node status
Aug 18, 2023 00:31:50 | galera-manager | root@10.0.1.5# mysqladmin -u root status
Aug 18, 2023 00:31:50 | stdout         | mysqladmin: connect to server at 'localhost' failed
Aug 18, 2023 00:31:50 | stdout         | error: 'Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)'
Aug 18, 2023 00:31:50 | stdout         | Check that mariadbd is running and that the socket: '/run/mysqld/mysqld.sock' exists!
Aug 18, 2023 00:31:50 | galera-manager | mariadb is apparently not running
Aug 18, 2023 00:31:50 | galera-manager | setting cluster-wide Lock (to avoid race conditions with the first node)
Aug 18, 2023 00:31:50 | galera-manager | starting as a first node
Aug 18, 2023 00:31:50 | galera-manager | checking grastate
Aug 18, 2023 00:31:51 | galera-manager | root@10.0.1.5# cat /var/lib/mysql/grastate.dat
Aug 18, 2023 00:31:51 | stdout         | # GALERA saved state
Aug 18, 2023 00:31:51 | stdout         | version: 2.1
Aug 18, 2023 00:31:51 | stdout         | uuid:    62be2c5d-3d42-11e6-ac82-0e98101eb438
Aug 18, 2023 00:31:51 | stdout         | seqno:   69
Aug 18, 2023 00:31:51 | stdout         | safe_to_bootstrap: 1
Aug 18, 2023 00:31:51 | galera-manager | running start script
Aug 18, 2023 00:31:51 | galera-manager | root@10.0.1.5# echo -n "test"
Aug 18, 2023 00:31:51 | stdout         | test
Aug 18, 2023 00:31:51 | galera-manager | Default Galera version is 4
Aug 18, 2023 00:31:51 | galera-manager | Including custom config directory from my.cnf
Aug 18, 2023 00:31:51 | stdout         | 10.0.1.5:22$ bash -c '[ -f /var/lib/mysql/grastate.dat ] && sed -i '"'"'s/safe_to_bootstrap: .*/safe_to_bootstrap: 1/'"'"' /var/lib/mysql/grastate.dat || true'
Aug 18, 2023 00:31:51 | galera-manager | Will fix grastate.dat (if required)
Aug 18, 2023 00:31:51 | galera-manager | Running the first node in the cluster
Aug 18, 2023 00:31:51 | stdout         | 10.0.1.5:22$ galera_new_cluster
Aug 18, 2023 00:31:52 | stdout         | Job for mariadb.service failed because the control process exited with error code.
Aug 18, 2023 00:31:52 | stdout         | See "systemctl status mariadb.service" and "journalctl -xe" for details.
Aug 18, 2023 00:31:52 | galera-manager | Got an error and attepts = 0
Aug 18, 2023 00:31:52 | galera-manager | SshHost.RunScript error: command failed (stepName=__step_no_001, commandId=1, commandType=IncludeCommand): command failed (stepName=run_cluster_first, commandId=3, commandType=ExecCommand): Process exited with status 1failed to execute cluster config script (RunScriptWithConn)
github.com/codership/galera-manager/pkg/internal/sshcmd.(*Host).RunScriptWithConn
    /go/pkg/internal/sshcmd/executor.go:115
github.com/codership/galera-manager/pkg/internal/sshcmd.(*Host).RunScript
    /go/pkg/internal/sshcmd/executor.go:171
github.com/codership/galera-manager/pkg/internal/mgmt/units.(*Node).Start
    /go/pkg/internal/mgmt/units/node.go:483
github.com/codership/galera-manager/pkg/internal/mgmt.(*Nodes).Start.func1
    /go/pkg/internal/mgmt/nodes.go:180
github.com/codership/galera-manager/pkg/internal/jobs.(*Processor).Execute.func1
    /go/pkg/internal/jobs/processor.go:90
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
Aug 18, 2023 00:31:52 | galera-manager | Exit status is not 0. Database engine start failure?
Aug 18, 2023 00:31:52 | galera-manager | error starting the node

RoyvanEmpel commented 11 months ago

It seems that after readding the node some configs are missing?

# systemctl status mariadb.service
● mariadb.service - MariaDB 10.6.15 database server
     Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/mariadb.service.d
             └─migrated-from-my.cnf-settings.conf
     Active: failed (Result: exit-code) since Fri 2023-08-18 00:31:52 CEST; 7min ago
       Docs: man:mariadbd(8)
             https://mariadb.com/kb/en/library/systemd/
    Process: 51186 ExecStartPre=/usr/bin/install -m 755 -o mysql -g root -d /var/run/mysqld (code=exited, status=0/SUCCESS)
    Process: 51187 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
    Process: 51189 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`cd /usr/bin/..; /usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)
    Process: 51290 ExecStart=/usr/sbin/mariadbd $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=0/SUCCESS)
    Process: 51309 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
    Process: 51311 ExecStartPost=/etc/mysql/debian-start (code=exited, status=203/EXEC)
   Main PID: 51290 (code=exited, status=0/SUCCESS)
     Status: "MariaDB server is down"

Aug 18 00:31:51 db-201 systemd[1]: Starting MariaDB 10.6.15 database server...
Aug 18 00:31:51 db-201 sh[51190]: WSREP: Recovered position 62be2c5d-3d42-11e6-ac82-0e98101eb438:69
Aug 18 00:31:52 db-201 systemd[51311]: mariadb.service: Failed to execute command: No such file or directory
Aug 18 00:31:52 db-201 systemd[51311]: mariadb.service: Failed at step EXEC spawning /etc/mysql/debian-start: No such file or directory
Aug 18 00:31:52 db-201 systemd[1]: mariadb.service: Control process exited, code=exited, status=203/EXEC
Aug 18 00:31:52 db-201 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Aug 18 00:31:52 db-201 systemd[1]: Failed to start MariaDB 10.6.15 database server.

RoyvanEmpel commented 11 months ago

After A LOT of problems i finally got the nodes back up and running. I deleted all the left over configs and files and services from the nodes that galera manager didn't delete. After all the nodes were up and running again in galera manager i they can get monitored but can not add new nodes using the UI.

Since i had to use the galera_new_cluster command galera manager now doesn't see it as the same cluster and it can't do anything. Is there somewere an UUID that i can edit so galera manager is able to pickup the new cluster?

RoyvanEmpel commented 11 months ago

I fixed my issues by starting the new galera cluster and using rsync to sync all the nodes. Now that it's back up i shutdown all the servers and recovered the cluster using the galera manager.

These are my steps for fixing my initial issue:

Stop all servers
Delete the nodes from galera manager.
Delete /var/lib/mysql contents
Delete /etc/mysql contents
Delete the service if it hasn't geen removed automatically
- systemctl stop mariadb
- systemctl disable mariadb
- rm /etc/systemd/system/mariadb
- rm /etc/systemd/system/mariadb.d (folder)
- rm /usr/lib/systemd/system/mariadb
- rm /usr/lib/systemd/system/mariadb.d
- systemctl daemon-reload
- systemctl reset-failed
Add node 1 into the galera cluster
Check for errors systemctl status mariadb
Add the remaining nodes.

codership / galera-manager-support

Nodes no longer starting after restoring a backup #65