Drain does not check for Galera cluster health

cloudfoundry / cf-mysql-release

Cloud Foundry MySQL Release

Apache License 2.0

58 stars 106 forks source link

Drain does not check for Galera cluster health #208

Closed CAFxX closed 6 years ago

CAFxX commented 6 years ago

The drain script currently does not check whether the cluster is healthy before stopping a node:

https://github.com/cloudfoundry/cf-mysql-release/blob/f78967ba30660fab51301ce345665df5cf12e40b/jobs/mysql/templates/drain.sh#L1-L6

We think it would be safer to check if all cluster nodes are online and in sync before asking the local node to shutdown.

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157141182

The labels on this github issue will be updated when the story is started.

ctaymor commented 6 years ago

@CAFxX If the drain script can only succeed when the cluster is healthy, that would prevent anyone from bosh stopping an unhealthy cluster.

We can see how this change would be valuable in some cases, but we're concerned it would be harmful in others. We recommend using the proxy dashboard api to check if the cluster is healthy before doing a deploy.

CAFxX commented 6 years ago

@ctaymor isn't that why bosh deploy and bosh recreate both have a --skip-drain option? If we applied the logic that the operator had to check manually on a job-specific dashboard before running deploy wouldn't cf-deployment become unusable pretty quickly? Also the bosh docs seem to agree that's the whole point of drain scripts (emphasis mine):

This script allows the job to clean up and get into a state where it can be safely stopped.

I think the safer option is to check for obvious signs of the cluster being unhealthy and allow the operator to force the deployment if they think it's the correct thing to do in their case.

ctaymor commented 6 years ago

@CAFxX Yeah, that's a fair point about --skip-drain.

We'd be happy to accept a PR.

CAFxX commented 6 years ago

@ctaymor ok, just a question: is there currently any specific requirement requiring the node to be stopped during drain (instead of during monit stop)?

ndhanushkodi commented 6 years ago

Closing due to open PR to resolve: #209

ndhanushkodi commented 6 years ago

Sorry @CAFxX, thought we commented on this earlier. Yes, we do want to do the stopping in drain since for larger installations it can go over the monit timeout and cause bosh deploy failures.

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/158265416

The labels on this github issue will be updated when the story is started.