cloudfoundry / cf-mysql-release

Cloud Foundry MySQL Release
Apache License 2.0
58 stars 106 forks source link

Add cluster health check to drain script before shutting down local node #209

Closed iamejboy closed 6 years ago

iamejboy commented 6 years ago

Thanks for opening a PR. Please make sure you've read and followed the Contributing guide, including signing the Contributor License Agreement.

Feature or Bug Description

Amended mysql drain script to check cluster nodes health before allowing local node to shutdown.

Motivation

We think this adds safeness to update cluster when it is being check on drain before shutting down local node.

Related Issue

https://github.com/cloudfoundry/cf-mysql-release/issues/208

cfdreddbot commented 6 years ago

Hey iamejboy!

Thanks for submitting this pull request!

All pull request submitters and commit authors must have a Contributor License Agreement (CLA) on-file with us. Please sign the appropriate CLA (individual or corporate).

When sending signed CLA please provide your github username in case of individual CLA or the list of github usernames that can make pull requests on behalf of your organization.

If you are confident that you're covered under a Corporate CLA, please make sure you've publicized your membership in the appropriate Github Org, per these instructions.

Once you've publicized your membership, one of the owners of this repository can close and reopen this pull request, and dreddbot will take another look.

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157656403

The labels on this github issue will be updated when the story is started.

ctaymor commented 6 years ago

Hi @CAFxX, Thanks for the PR. Please respond to the issue to let us know when you've signed the CLA, and we'll close and reopen the PR to satisfy the robots.

iamejboy commented 6 years ago

Hi @ctaymor,

We believe we are covered with Corporate CLA and already publicized our membership.

Regards

ctaymor commented 6 years ago

Closing and reopening to get cfdreddbot to pick up the CLA.

cfdreddbot commented 6 years ago

Hey iamejboy!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157781101

The labels on this github issue will be updated when the story is started.

ndhanushkodi commented 6 years ago

Thanks for the feedback @bgandon. @iamejboy, if you want to make any of these changes, feel free. Otherwise we can take care of any cleanup when we merge the PR.

@iamejboy We were going over the functionality when it checks the Synced state for all of the nodes. Typically two nodes wouldn't normally be doing a state transfer when trying to drain the third node. Any state transfer should be finished before it starts draining the third node.

Are we missing a edge case here where you've seen this happen?

We suggest exit 1 in a non-Synced case since it removes the echo -5 which has risks of being an infinite loop if a node was stuck in a non-Synced state. In this case the bosh deploy would just hang forever since there is no drain timeout.

It also seems that it is impossible for a node to be Synced, and non-Primary. If this is the case, we probably don't need to check the Primary component. Does this match your understanding of these two fields?

Thoughts?

iamejboy commented 6 years ago

@bgandon thanks for your feedback. @ndhanushkodi Yes, you may take care the cleanup.

Are we missing a edge case here where you've seen this happen?

  • have not seen yet as we are still evaluating cluster carefully, but surprise that drain doesn't do cluster check as additional safety net.

It also seems that it is impossible for a node to be Synced, and non-Primary

  • I see, if that is the case, there's no need to check the other.
CAFxX commented 6 years ago

Typically two nodes wouldn't normally be doing a state transfer when trying to drain the third node. Any state transfer should be finished before it starts draining the third node.

Yes, there's a couple of cases AFAICT:

ndhanushkodi commented 6 years ago

Hey @CAFxX and @iamejboy,

We were trying to do some verification and testing around the changes in the drain script. We partitioned off one node so it was Initialized while the other nodes were Synced.

Unfortunately, when the node is partitioned off, the other nodes drop it from their wsrep_incoming_addresses list, so the healthy nodes continue through the drain script without detecting the degraded state of the cluster.

The script as it is does catch some very timing dependent cases, but not larger cluster problems.

One thought we had was to check all ips of the cluster, using the bosh link to get them all. That causes problems during a scale down though since BOSH deletes the unneeded nodes first, then the drain script would be unable to connect to them.

Any thoughts on how it might be made better?

CAFxX commented 6 years ago

Any thoughts on how it might be made better?

I feel like there should be an easier way, but this should work (using some drain magic):

or, similarly:

@ndhanushkodi do you think this would work?

CAFxX commented 6 years ago

@ndhanushkodi any feedback?

ldangeard-orange commented 6 years ago

hi, I test split-brain on galera cluster. In this case status are

+---------------------------+--------------------------------------+
| Variable_name             | Value                                |
+---------------------------+--------------------------------------+
| wsrep_cluster_status      | non-Primary                          |
| wsrep_local_state_comment | Initialized                          |
+---------------------------+--------------------------------------+
ctaymor commented 6 years ago

Hi @CAFxX, Having a drain table sounds like a really expensive fix to this problem. We then have to add logic to avoid dumping it every time we backup the database, or for users migrating their database (for instance to PXC).

We thought of one potential solution which would catch some of these cases, and maybe be less expensive. The drain script could look to the bosh link to get the ips of the cluster, and call galera-healthcheck API for the other nodes. If the node is unreachable, we would drain, assuming it was deleted. If the node reports a healthy status, we would drain. If the node reported unhealthy, we would fail the drain script.

This would catch some cases of an unhealthy cluster, but the drain script would not fail in a full network partition. What do you think about that trade-off?

CAFxX commented 6 years ago

@ctaymor for now it could work, I would suggest to doublecheck for unreachability of a node also via ICMP over a period of time (i.e. if the host does not respond to pings for e.g. 60 seconds -- this would obviously slow down the shrinking process, but I would argue that such an operation is going to be rarely executed anyway). The rationale for this is that at least some IaaS (e.g. vsphere) have provisions to immediately restart a missing VM, so attempting to ping the VM for a period of time should hopefully catch the case in which the VM disappears (e.g. due to host crash) right in the middle of a deploy, as well as the case in which the VM is running but mysql is not working.

Better long term solutions would be either

ctaymor commented 6 years ago

@CAFxX I think you hit the nail on the head, that we're trying to come up with a mediocre workaround when the solution to enabling good drain scripts for clustered services should really be pushed down to the bosh level. I'd recommend opening an issue in BOSH.

Thanks for spending all this time working on trying to make the drain scripts better.

CAFxX commented 6 years ago

@ctaymor ok, we'll improve the current good-enough-for-now drain impl as discussed here and in the background will open an issue to bosh. Based on the slack discussion we should allow the operator to opt-out of the ICMP protection (e.g. for the case in which ICMP is not allowed in the deployment networks).

CAFxX commented 6 years ago

I also reached out to dmitriy and it seems there's some activity already happening in SAP for extending bosh for similar use cases:

https://github.com/cloudfoundry/bosh-notes/blob/master/proposals/drain-actions.md

iamejboy commented 6 years ago

Hi @ctaymor & @ndhanushkodi

Updated PR with cleanup plus what was discussed above. Feel free to re-test and amend for any additional or found errors.

Regards,

CAFxX commented 6 years ago

@ctaymor @ndhanushkodi @bgandon have you had a chance to look at this?

ndhanushkodi commented 6 years ago

Hey @iamejboy @CAFxX,

We've added a story to our backlog to pull in some of these changes to fix the drain script issues, and we may change the implementation a little. Unfortunately, our team is stretched a little thin and this will have to be prioritized low for now, but we should be able to eventually get to it.

Thanks, Nitya and Joseph (@jpalermo)

CAFxX commented 6 years ago

@ndhanushkodi our production rollout of c2c is blocked on this (and the other c2c bug @iamejboy reported a few days ago, but it seems that a candidate fix is available now), so we were hoping it would happen sooner than "eventually". Just to organize work on our side, any rough estimate about when you realistically expect to be able to work on this?

ctaymor commented 6 years ago

Hi @CAFxX, we do want to go ahead and try to unblock you. We're scheduling it to work on soon. https://www.pivotaltracker.com/story/show/157656403

jpalermo commented 6 years ago

We updated the drain script to use the Galera Healthcheck endpoint rather than determining node health directly (since this is a job Galera Healthcheck already performs).

Should be merged into the develop branch now. Let us know if you have any questions or concerns.

jpalermo commented 6 years ago

Hey @CAFxX. Looks like our pipelines ran into a problem.

When deleting a deployment, BOSH does not respect the max-in-flight options. So rather than draining 1 node at a time, it drains all at the same time. With the new drain script behavior in place, this caused most of the delete-deployments we ran to fail.

You can get around this with a --force or a --skip-drain, but we don't think this is a reasonable requirement for deleting a healthy deployment.

We talked with the BOSH team to see if there is any way for us to differentiate scaling down a node vs deleting a deployment from the perspective of the drain script; but there is no way to differentiate. This is not the first time they've heard of this issue and are going to prioritize some work to resolve the parallel draining, but there is no guarantee it will get done soon.

We don't think there is a simple enough solution in cf-mysql-release to implement this behavior currently.

However, we did port the behavior over to our new pxc-release. This is intended to be a replacement for cf-mysql-release moving forward.

Because we've structured the jobs differently in pxc-release, and separated the galera-agent/galera_health_check job out with its own drain scripts, we think the drain script will work there without any problem.

CAFxX commented 6 years ago

@jpalermo in my understanding the drain script is designed to aid graceful/zero-downtime lifecycles for bosh jobs and, at least from my perspective, it's hard to describe deleting a whole deployment as a "graceful" operation. If we agree on this, then using --force/--skip-drain in the pipelines should be OK (but I agree that the BOSH behavior is at fault here and should be changed, so much so I already filed https://github.com/cloudfoundry/bosh-notes/pull/47 a while back).

However, we did port the behavior over to our new pxc-release. This is intended to be a replacement for cf-mysql-release moving forward.

We would like to avoid using in production pre-release/non-GA components or, as in the case of the current cf-mysql-release, components with known issues that can affect reliability.

jpalermo commented 6 years ago

@CAFxX if it were just our pipelines, we would probably consider requiring the --force/--skip-drain, but there are many automated deployments and tools out in production customer instances that depend on a vanilla bosh delete-deployment working.

Open source software is a bit strange, since it can be hard to determine what is pre-release and what is ready for production. I can tell you that our biggest deployments are now using pxc-release in production and Pivotal's latest version of Pivotal Cloud Foundry just shipped with pxc-release as the default database.

We have not hit the 1.0 release banner for pxc yet: https://www.pivotaltracker.com/story/show/157528771

At which point we plant to transition cf-deployment away from cf-mysql release. But the only thing holding up the 1.0 release at this point is we want to change the bosh property syntax for creating users/databases, nothing to do with stability or performance.

CAFxX commented 6 years ago

@jpalermo would you be open to leave it behind an opt-in manifest property? We are also considering pxc-release, but we were really close to rolling this out and moving to pxc is going to require to redo a significant amount of validation work that would push back our c2c rollout for weeks.

jpalermo commented 6 years ago

I'll see what the mysql product managers think of that idea today.

iamejboy commented 6 years ago

Hi @jpalermo @ctaymor @ndhanushkodi,

Thanks for adding this as option.

By the way, can we removed 000 here https://github.com/cloudfoundry/cf-mysql-release/blob/4c657b38ff7998c46e89d9163baee80b891a9d5a/jobs/mysql/templates/drain.sh#L32

Since we already put on spec desc, "---skip-drain flag is required in order to delete a deployment" then add "scaling down"

In this case, we will fail drain if we have unreachable node, crashed node, or simply monit stopped node in the cluster. Drain script will be more useful to make sure that cluster is healthy before shutting down the local node. Unlike when 000 is added, we assume its scaling down but its not always the case. It could be either of those mentioned.

cc: @CAFxX