Recover from power failure in DC

lewismarshall commented 6 years ago

If all nodes are ever stopped e.g. from a power failure in a DC (an operational tested scenario), the cluster will fail to start.

It would seem as though the /var/lib/mysql/grastate.dat file never has suitable information for an automated recovery (as documented on galeracluster.com).

This results in the following symptoms: 2018-05-25 13:29:33 140151997229312 [Warning] WSREP: no nodes coming from prim view, prim not possible

And then fail to connect (the same on all nodes):

2018-05-25 13:30:03 140151997229312 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
     at gcomm/src/pc.cpp:connect():158
2018-05-25 13:30:03 140151997229312 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2018-05-25 13:30:03 140151997229312 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'galera' at 'gcomm://mysql-0.galera.sysdig.svc.cluster.local,mysql-1.galera.sysdig.svc.cluster.local,mysql-2.galera.sysdig.svc.cluster.local': -110 (Connection timed out)
2018-05-25 13:30:03 140151997229312 [ERROR] WSREP: gcs connect failed: Connection timed out
2018-05-25 13:30:03 140151997229312 [ERROR] WSREP: wsrep::connect(gcomm://mysql-0.galera.sysdig.svc.cluster.local,mysql-1.galera.sysdig.svc.cluster.local,mysql-2.galera.sysdig.svc.cluster.local) failed: 7
2018-05-25 13:30:03 140151997229312 [ERROR] Aborting

All the grastate.dat files seem equivalent so any should/could potentially be startable:

$ for i in {0..2} ; do kubectl -n sysdig exec -it mysql-${i} cat -- /var/lib/mysql/grastate.dat ; done
# GALERA saved state
version: 2.1
uuid:    2abe687e-6011-11e8-97dd-c70a88466155
seqno:   -1
safe_to_bootstrap: 0
    # GALERA saved state
version: 2.1
uuid:    2abe687e-6011-11e8-97dd-c70a88466155
seqno:   -1
safe_to_bootstrap: 0
    # GALERA saved state
version: 2.1
uuid:    2abe687e-6011-11e8-97dd-c70a88466155
seqno:   -1
safe_to_bootstrap: 0
$

A bit worried the UUID seems to be the same for all nodes...

See the complete logs here: mysql-2.log mysql-1.log mysql-0.log

tongpu commented 6 years ago

The uuid represents the cluster uuid at the time of the crash, so it must be the same on every node in the cluster. You would need to manually edit the grastate.dat, like you would for any other Galera Cluster.

One issue I see is that our wrapper script only allows bootstrapping from the first node (mysql-0) so if the other nodes have still accepted write queries before the crash you would be loosing this data. You would need to add wsrep_recover=1(explained here) to the configuration to check the GTID on every node.

To get the cluster up and running set safe_to_bootstrap: 1 in the grastate.dat in the PV of mysql-0 and the cluster should come up again.

lewismarshall commented 6 years ago

@tongpu thanks for the uuid note. I've recovered the cluster manually (too many times in a test cluster) as you descibed.

This issue is about the operationally preferable option for a self healing cluster and it seems as though it's something galera supports with the grastate.dat but it ~~always seems to contain -1~~ for the seqno. I'm adding some debugging and doing some more tests to see if this is always the case - the nodes do always seem to shutdown clean when pods deleted (or when physical nodes) are shut down.

Looking at the variouse galara docs here it seems as though setting wsrep_provider_options="gcache.recover=yes" maybe required and also the additional option pc.recovery=TRUE as described here may also be required (although it does look like a default). I'll do some testing...

lewismarshall commented 6 years ago

After debugging, the grastate.dat typically has 0 for the sequence number but if the pods are deleted, the nodes fail to move off of the first node. Maybe health checks should be sensitive to recovery operations so all nodes can come up. Will have a think over the weekend.

tongpu commented 6 years ago

Normally a Galera Cluster automatically recovers from a power loss scenario. But only if all the nodes come back up at nearly the same time, otherwise you have to find the most current one using wsrep_recover=1. Because of the way the StatefulSet API starts pods this will probably never happen in this situation, because we're always going to try to bootstrap from the first node.

adfinis / openshift-mariadb-galera

Recover from power failure in DC #19