colinmollenhour / mariadb-galera-swarm

MariaDb Galera Cluster container based on official mariadb image which can auto-bootstrap and recover cluster state.
https://hub.docker.com/r/colinmollenhour/mariadb-galera-swarm
Apache License 2.0
217 stars 102 forks source link

gossip about found nodes #64

Closed lexelby closed 5 years ago

lexelby commented 5 years ago

This is a fairly substantial change to the cluster assembly communication. Nodes will now send the full list of nodes they know about to all other nodes, rather than just their own information.

Why do this? I'm building an environment in which nodes will come and go routinely. There's a possibility that not all nodes will be up when a given node starts, and we'll be using DNS-based service discovery for nodes to find each other.

In the case where the entire cluster is rebooted more or less simultaneously, some nodes will resolve DNS before other nodes have been added to DNS. GCOMM_MINIMUM is designed to handle this scenario, ensuring that all nodes keep re-resolving DNS until they know about all other nodes before moving on to exchange position information. However, I can't set GCOMM_MINIMUM to the number of nodes in my cluster because at any given time, one or more nodes may not be in the cluster.

Setting a lower GCOMM_MINIMUM will allow some nodes to proceed to run mysqld.sh without knowing about the full list of other nodes. I saw asymmetric discovery scenarios, where some nodes saw the full cluster but others didn't (and never would).

This patch seems to solve my problem. It models the log position exchange in the same way that Galera's node discovery works. A given Galera node need only connect to one node in the cluster, and it will then learn about the rest of the cluster and connect to the other nodes.

Now, I can set GCOMM_MINIMUM to 1. Even if all nodes only see the first node that registered in DNS, they still discover the positions of every node in the cluster in mysqld.sh and the cluster assembles properly.

I'd love to hear what you think about this. Is it a terrible idea for some reason I'm not thinking of?

colinmollenhour commented 5 years ago

Thanks for the PR, I haven't had time to look into it deeply yet but will try to soon.

lexelby commented 5 years ago

No rush at all, but I wanted to let you know that I have further work that I plan to send your way after/if this PR is merged.

I discovered a nasty little split-brain scenario. It goes something like this:

  1. Cluster partition (e.g. inter-datacenter link failure, etc).
  2. Partition A has Primary Component and continues operating.
  3. Partition B goes non-primary.
  4. Docker kills all members of B soon thereafter since their health checks are failing.
  5. All members of B start at approximately the same time.
  6. Members of B discover each other, choose the lowest-IP node, and bootstrap.

Now A and B are two separate clusters, each of which is a Primary Component. Bad news, especially if the inter-datacenter link comes back.

In my setup, the best solution here is to disallow cluster reassembly unless a human is present and enables it (with a flag file). Not sure if that makes sense for all clusters. Setting GCOMM_MINIMUM > the size of A or B would prevent this, but in my case that's not a good solution because nodes will come and go.

Thanks again for writing this container image!

colinmollenhour commented 5 years ago

Your explanation makes sense and the code changes look good. I don't really have time to test this thoroughly, but it sounds like you've done quite a bit of testing with it. Are you using this PR in production?

About the additional 11 second sleep, can you explain more about why you think this is necessary and how you came up with 11 seconds? There is obviously a trade-off between ensuring a perfect reboot and being able to reboot quickly. Waiting multiple minutes for it to come back up can be painful so I'm interested to know what experiences you've had. I suppose you are doing less waiting for GCOMM_MINIMUM so this additional 11 seconds might be actually faster in net than waiting forever initially?

Thanks again for sharing your improvements!

lexelby commented 5 years ago

but it sounds like you've done quite a bit of testing with it. Are you using this PR in production?

I have tested fairly extensively, but only in a pre-production environment so far. I see nothing that concerns me currently, but of course production has a way of turning up issues we don't foresee. :) I'll let you know if anything comes up.

About the additional 11 second sleep, can you explain more about why you think this is necessary and how you came up with 11 seconds?

This is a bit nuanced, and I almost fooled myself into forgetting the reasoning behind it as I tried to type out an explanation.

The sleep is here because of this logic in mysqld.sh:

if   [[ $(<$tmpfile grep -vF :$NODE_ADDRESS: | awk -F: '/^seqno:/{print $2}' | sort -u | wc -w) -ge $EXPECT_NODES ]] \
  && [[ $(<<<$SENT_NODES tr ',' '\n' | sort -u | wc -w) -ge $EXPECT_NODES ]]
            then

This means that no node will consider itself ready to boot unless it has communicated outward to all other nodes. One node may hear about all nodes in $EXPECT_NODES before they have successfully communicated to it. If that node disappears immediately upon sending to them, it will not be there to hear them communicate with it, and those nodes will be stranded.

My logic in choosing a sleep of 11 was something like "two gossip cycles, plus a fudge". The fudge is because each cycle is probably more than 5 seconds, factoring in network latency, disk IO, etc.

A single cycle (6 seconds) is probably enough here, but I wanted to be really sure that no nodes get left behind. It's just possible that some node finished sending just before it heard from this node, does its processing, sleeps for 5 seconds, and barely misses sending to this node. We could try 6 if you'd like, but I find that 11 seconds is still quite quick, and a full cluster reboot is a relatively rare event.

colinmollenhour commented 5 years ago

Thanks for the additional explanation. I don't have any objections or further questions or concerns other than not having had a chance to test it myself. Please let me know if you find any issues or further improvements to add but I'm pretty certain this will get merged eventually. :)

colinmollenhour commented 5 years ago

I've still not had a chance to play with this at all.. Are you still using it? Do you think it should be merged in the current state?

lneva-fastly commented 5 years ago

It's worked great for us for the past several months, so I vote merge as is.

I switched to this new account for work, sorry for any confusion :)

colinmollenhour commented 5 years ago

Thanks, Lex!