Galera startup scenarios

Hi Guys,

This isn't really a bug or an issue, but I'm looking for help from people that might ave more user experience and knowledge about Galera, and specifically this Docker implementation. Firstly, I have to say that this project is really good and I've enjoyed working with it so far. I will have some contributions to make once I've cleared this up and tidied up my branch a bit.

I've been playing around with the recovery sequence of a 3 node Galera cluster. The scenario I'm investigating is how it recovers following a complete power cut to all three nodes.

I can't decide which would be the most appropriate recovery strategy in the following scenario: (I have used numbered nodes here just to help illustrate the scenario, but in reality it's arbitrary which nodes are in these states.)

On power cycle, galera node 2 manages to cleanly leave the cluster, meaning that galera nodes 1 & 3 are left with a view of 2.
When power is restored, it is node 2 which powers on first, meaning that it will have the lowest IP address (this is important in the fall back strategy later)
As the three nodes exchange state information node 2 doesn't take into account view state as it's local gvwstate.dat file was (correctly) deleted when it left the cluster, as a result node 2 will decide that it should start a new cluster because sequence numbers are consistent and it has the lowest IP address.
However, nodes 1 & 3 do investigate the view state during the state information exchange and notice that their previous view of 2 is consistent and as a result, they immediately try to reform the cluster (primary component).

I already have some prototype solutions so that only one of these conditions is executed as I feel that one node trying to create a new cluster and 2 other nodes trying to reform a previous cluster is not ideal. I'm struggling to decide which (because of my limited knowledge) is the correct decision that should win.

So in summary:

Should a node (with lowest IP address) not part of the view information make a new cluster
Should the 2 nodes reform the previous cluster (primary component) and the singled out node should attempt to join it?

Cheers for the help in advance. I don't mean for this to be too difficult.

After further investigation I have discovered that the easiest to implement is option 1, preferring the node with the highest sequence number and lowest IP, which may not necessarily be part of the 2 node view.

Option 2 I now think isn't possible because from the point of view of node 2 (in my example), based solely on the data it received during the state data exchange phase it cannot guarantee that the other nodes would choose to reform the primary component because it doesn't know how many members there should be. I came to this conclusion by extrapolating to a 4 node cluster (which you wouldn't normally do), where three nodes are pat of the view, but one of them doesn't turn back on after the power outage. Thus just because node 2 can see view state from 2 other nodes, in this situation they wouldn't reform the primary component because those nodes are expecting a third member.

Cheers

Rich

Hi Rich. The current script does the following in a simplified explanation:

Checkpoints (hopefully) all nodes so they start at roughly the same time.
Checks for any nodes already operating as PC and joins
Send/receive state data to/from all other nodes
See if all nodes are consistent and if so restore the cluster (in a full-DC power cycle I think this is most likely)
Otherwise see if only one node has the safe_to_bootstrap flag (in a clean sequential shutdown this is most likely)
Otherwise find the best node by seqno and use that if one machine is higher than all others
Otherwise use the lowest IP of the nodes sharing the highest seqno to bootstrap.

You suggest "highest sequence number and lowest IP" which is what it is already doing when multiple nodes have the same highest seqno. If only one node has the highest seqno then that one will win regardless of IP. IMHO the cluster should not be recovered unless all nodes agree, otherwise, one must be chosen to start a new cluster so it is just a matter of choosing the best one. Trying to recover part of the cluster seems overly complicated and I think IST will be used so the other nodes should join very quickly. I'm not seeing where you are suggesting an improvement, but if you want to submit a PR I'd be happy to evaluate it and discuss it further.

colinmollenhour / mariadb-galera-swarm

Galera startup scenarios #42

Cheers for the help in advance. I don't mean for this to be too difficult.