codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
448 stars 176 forks source link

Mariabackup SST Failure: "... cannot be performed on a running server..." #569

Closed lukeescude closed 4 years ago

lukeescude commented 4 years ago

Hello!

We are using several MariaDB 10.2 servers in a Galera cluster. This system has its quirks but it has overall been fantastic to work with.

After I upgraded to using Mariabackup as the IST/SST method, things have worked smoothly, but I'm noticing this error:

"You have configured 'mariabackup' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server."

Despite being a long-worded error, it is still unclear as to what's wrong and what needs to be done.

I believe the wsrep_sst_method setting is Dynamic, but is it possible a MySQL restart is really required to make sure it works?

We recently had a couple connectivity issues and several nodes came back up using IST successfully, but twice now (two different nodes) complained that the operation was cancelled (on the donor) because the "SST Request was Null". I am not sure if this is related.

Thanks!

claudionanni commented 4 years ago

Luke,

On Tue, Feb 25, 2020, 21:57 Luke Escudé notifications@github.com wrote:

Hello!

We are using several MariaDB 10.2 servers in a Galera cluster. This system has its quirks but it has overall been fantastic to work with.

After I upgraded to using Mariabackup as the IST/SST method, things have worked smoothly, but I'm noticing this error:

"You have configured 'mariabackup' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server."

Despite being a long-worded error, it is still unclear as to what's wrong and what needs to be done.

  • What constitutes a "running server?"
  • Does this indicate I need to "restart the server" every time an SST is required?
  • How do I permanently fix it?

I believe the wsrep_sst_method setting is Dynamic, but is it possible a MySQL restart is really required to make sure it works?

SST is basically a backup(from Donor) / restore(on Joiner) needed when the node is new or too far behind with the cluster.

SST (as backups) can be binary(datadir copy) or logical(SQL dump).

You can only restore a binary backup in a stopped node while you can only restore a logical backup on a running mode.

Q.1: 'running server's is just a running node/instance.

Q.2: In 99% if the cases SST is needed when the node is already stopped and it's being restarted.

In very rare cases an SST is needed on running servers, usually IST is enough. IST is a partial synchronization via a transactions cache.

In such very rare case a binary SST method won't work because you cannot restore a datadir on a running server, so the node will stop.

Logical SST methods (mysqldump) is not used in practice(slow restore).

Q.3: There is nothing to fix

For when SST is needed and what is the difference with IST please refer to the documentation.

Best Regards Claudio

lukeescude commented 4 years ago

Claudio,

Thank you for the quick response!

So, then I think my actual question is: Why would an SST with Mariaback up fail with an error that says "Operation cancelled" because the "request is Null"?

claudionanni commented 4 years ago

Hello Luke,

On Tue, Feb 25, 2020, 22:54 Luke Escudé notifications@github.com wrote:

Claudio,

Thank you for the quick response!

So, then I think my actual question is: Why would an SST with Mariaback up fail with an error that says "Operation cancelled" because the "request is Null"?

Hard to say without more details.

It could be a configuration error, missing packages, SELinux, and so on

You should check the log also on the Donor, SST triggers a script on both Joiner and Donor, both can fail for different reasons.

Best Regards Claudio

lukeescude commented 4 years ago

So, it turns out restarting all nodes (one at a time of course) fixed the issues with SST/IST not completing properly (with the Null result issue).

Also, using systemctl has given me more control over service startup time, max open files, logging, etc. so all our nodes are happy again!