cloudfoundry-incubator / kubo-deployment

Contains manifests used to deploy Cloud Foundry Container Runtime
https://www.cloudfoundry.org/container-runtime/
Apache License 2.0
275 stars 114 forks source link

Adding/removing a master node to the cluster is kicking off a whole cluster restart (canary deployment) #344

Open lgunta2018 opened 5 years ago

lgunta2018 commented 5 years ago

What happened: Adding/removing a master node to the cluster is kicking off the whole cluster restart (canary deployment)

What you expected to happen: Adding/removing a master node should not kick off the whole cluster restart

How to reproduce it (as minimally and precisely as possible):

Steps:

  1. Create a cluster with 3 master and 3 worker nodes
  2. Update the number of master nodes instances to 2 in cfcr.yml file
  3. Run the bosh deployment command, you can see the whole cluster restart

runlog:

Removing master node:

bosh deploy -d cfcr ${KD}/manifests/cfcr.yml -o ${KD}/manifests/ops-files/iaas/aws/cloud-provider.yml -o cfcr-ops.yml -l <(bbl outputs) Using environment 'https://10.0.0.6:25555' as client 'admin'

Using deployment 'cfcr'

Release 'cfcr-etcd/1.5.0' already exists.

Release 'bpm/0.12.3' already exists.

addons:

Release 'docker/32.0.0' already exists.

instance_groups:

Continue? [yN]: y

Task 120

Task 120 | 23:47:07 | Preparing deployment: Preparing deployment (00:00:07) Task 120 | 23:47:39 | Preparing package compilation: Finding packages to compile (00:00:00) Task 120 | 23:47:39 | Deleting unneeded instances master: master/a02c36fb-a6d0-4ca9-8178-8a929087d32e (2) (00:00:43) Task 120 | 23:48:22 | Updating instance master: master/ba1fbbda-5080-4094-b8c8-671d9abb34a6 (0) (canary) (00:01:05) Task 120 | 23:49:27 | Updating instance master: master/c5dd8b42-ee9e-4607-ad44-152144a7eebf (1) (00:01:22) Task 120 | 23:50:49 | Updating instance worker: worker/43ac3278-5a5b-4a9b-a782-e3b52254f98d (0) (canary) (00:00:33) Task 120 | 23:51:22 | Updating instance worker: worker/41823365-c567-4acc-a2c3-d3df897ee8b3 (1) (00:00:35) Task 120 | 23:51:57 | Updating instance worker: worker/598eec97-2305-4eac-a784-2fee04d6121b (2) (00:00:41)

Same behavior is observed for adding a new master node:

→ bosh deploy -d cfcr ${KD}/manifests/cfcr.yml -o ${KD}/manifests/ops-files/iaas/aws/cloud-provider.yml -o cfcr-ops.yml -l <(bbl outputs) Using environment 'https://10.0.0.6:25555' as client 'admin'

Using deployment 'cfcr'

Release 'bpm/0.12.3' already exists.

Release 'docker/32.0.0' already exists.

Release 'cfcr-etcd/1.5.0' already exists.

Release 'bosh-dns/1.8.0' already exists.

instance_groups:

Continue? [yN]: y

Task 234

Task 234 | 00:10:30 | Preparing deployment: Preparing deployment (00:00:06) Task 234 | 00:11:07 | Preparing package compilation: Finding packages to compile (00:00:00) Task 234 | 00:11:07 | Creating missing vms: master/09e9e7e6-7e1f-4d51-b591-1c0f1e8ca12d (2) (00:01:27) Task 234 | 00:12:34 | Updating instance master: master/ba1fbbda-5080-4094-b8c8-671d9abb34a6 (0) (canary) (00:01:07) Task 234 | 00:13:41 | Updating instance master: master/c5dd8b42-ee9e-4607-ad44-152144a7eebf (1) (00:01:05) Task 234 | 00:14:46 | Updating instance master: master/09e9e7e6-7e1f-4d51-b591-1c0f1e8ca12d (2) (00:01:27) Task 234 | 00:16:13 | Updating instance worker: worker/43ac3278-5a5b-4a9b-a782-e3b52254f98d (0) (canary) (00:00:33) Task 234 | 00:16:46 | Updating instance worker: worker/41823365-c567-4acc-a2c3-d3df897ee8b3 (1) (00:00:42) Task 234 | 00:17:28 | Updating instance worker: worker/598eec97-2305-4eac-a784-2fee04d6121b (2) (00:00:35)

Task 234 Started Sat Sep 15 00:10:30 UTC 2018 Task 234 Finished Sat Sep 15 00:18:03 UTC 2018 Task 234 Duration 00:07:33 Task 234 done

Anything else we need to know?: Adding worker nodes is working fine, it is not restarting the whole cluster. kube-deployment : v0.21.0

Environment:

ame Release(s) Stemcell(s) Config(s) Team(s) cfcr bosh-dns/1.8.0 bosh-aws-xen-hvm-ubuntu-xenial-go_agent/97.16 1 cloud/default - bpm/0.12.3 2 runtime/dns cfcr-etcd/1.5.0 docker/32.0.0 kubo/0.21.0

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-10T11:44:36Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/160542355

The labels on this github issue will be updated when the story is started.

alex-slynko commented 5 years ago

Hi @lgunta2018

This is intended behaviour right now with current way CFCR is deployed. Each master is collocated with etcd node. Each worker uses flannel networking that is connected to the etcd. Flannel is reconfigured each time etcd node is added or removed. Bosh team is considering improving the flow so that only single job will get restarted, but this is not in their short or mid-term plans.

Out of curiosity, what was the reason to scale masters down to 2 VMs?

lgunta2018 commented 5 years ago

Hi @alex-slynko Thanks for the quick update, just to see how the cluster reacts when we remove a master node. So, if we lose the master node due to some issue, it means it restart the whole cluster to bring a new master node in place of the existing master node? Why do we need to restart the whole cluster when we adding a master node to the cluster?

lgunta2018 commented 5 years ago

Hey @alex-slynko , Thank you for the response, however we would like our apps deployed in the workers not to be restarted while handling a master node scale down/up/. Our expectation was borne out of the fact that though this is a control plane update, it shouldn't need to restart other parts of the control plane like the workers. Would it be possible to externalize the etcd from the master nodes (create etcd cluster with LB) and would that solve this issue of a rolling restart across the cluster? All our workloads are not cloud native apps and high have stateful sessions which we would like to keep from restarting unless necessary to reduce downtime on those apps.

youreddy commented 5 years ago

@lgunta2018,

if we lose the master node due to some issue, it means it restart the whole cluster to bring a new master node in place of the existing master node?

It depends, if bosh resurrector has noticed one of your master vms has gone missing then it will recreate it without touching the workers. Also, say your manifest states master instances: 3 but one of the masters is failing, in this case bosh hasn't seen a change to the instances count and it won't re-template the jobs and touch the workers

Why do we need to restart the whole cluster when we adding a master node to the cluster?

Basically what Oleksandr said above. flanneld is running on every vm in the cluster. The flannel job consumes etcd bosh links and iterates through the list of etcd instances. So every time the number of etcds change (aka the number of masters since they're colocated), the flannel job will get re-templated on every vm and bosh will update the instances.

Would it be possible to externalize the etcd from the master nodes (create etcd cluster with LB) and would that solve this issue of a rolling restart across the cluster?

Externalizing etcd would be a larger architectural change and may not necessarily solve the problem. We might be able to solve this in simpler way by not iterating through the list of etcd links and configuring flanneld to use the etcd bosh dns entry. We discussed this approach this morning but we need to spike it out because there's probably more than just those etcd links alone. There's a spike in our public tracker if you want to follow progress.

alex-slynko commented 5 years ago

Hi @lgunta2018

  1. If you want to test how the cluster react to removing master node, you need to use delete-vm or stop command. I wrote a tiny blog post where I tried to explain the difference.
  2. We created a spike to investigate it more, but I can't guarantee we will work on it or fix it soon. It might require big redesign for CFCR or some architectural Bosh changes. We might prioritize it if we see a business value in improving it.

There are two workarounds for this I can think of, but we haven't tried them

Feel free to ask more questions in Slack channel or here.

lgunta2018 commented 5 years ago

Thanks, @youreddy && @alex-slynko for looking into this issue. I will try to use your workarounds to solve this problem for now. But its very useful feature in my case. I hope we will see some traction in this regard. I will let you know if the workarounds do not work for me.