CF API Server becomes unavailable during updates

cloudfoundry / cf-for-k8s

The open source deployment manifest for Cloud Foundry on Kubernetes

Apache License 2.0

301 stars 115 forks source link

CF API Server becomes unavailable during updates #636

Open braunsonm opened 3 years ago

braunsonm commented 3 years ago

Describe the bug

In a production deployment downtime of the API Server during updates is not in line with CF-for-VMs. The default should deploy more than 1 replica and do a rolling update.

Current behavior

The API Server will be taken offline to update the image.

Expected behavior

More than 1 replica to remain online during CF Updates.

Additional context

cf-for-k8s SHA

v2.1.1

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/177271427

The labels on this github issue will be updated when the story is started.

matt-royal commented 3 years ago

Thank you for the issue, @braunsonm. We just committed a change to the develop branch that allows you to scale up the cf-api-server via a data value (capi.cf_api_server.replicas). Once this makes it into a release, you can easily scale up to 2+ replicas and avoid this problem.

braunsonm commented 3 years ago

@matt-royal the point of this issue was I believe this should be the default. This is a 5 cluster deployment and it is expected it should be highly available without a bunch of tweaks.

If not I'd recommend a document in the repo that tells users what steps they need to make to make it HA (external DB, external blobstore, recommended replica counts).

Birdrock commented 3 years ago

@braunsonm I'm re-opening this for more discussion.

We've found some configuration that may alleviate the problem, but the larger discussion is around what our default deploy target is. To the present, we've been targeting small clusters or developer workstations. A truly HA configuration isn't a very good out of the box kick-the-tires solution, so we may need to make some compromise.

To that end, the result of this issue may be to open a new issue with some clarified requirements.

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/177468264

The labels on this github issue will be updated when the story is started.

braunsonm commented 3 years ago

I thought the deployment was targeted close to something HA with the exception of the DB and Blobstore.

If the goal is to be similar to cf-deployment on Bosh then the default deployment should be HA with batteries included. With the remove_resource_requirements for developer machines. That's the way we personally have been treating it.

When the deployment requirements are a 5 node cluster that seems to be quite a stretch if you are defaulting your target to a developer workstation. As you said, even some clarified documentation for operators running this in production would be good to have 👍 If you need any help with that based on our experience don't hesitate to reach out.