carvel-dev / kapp-controller

Continuous delivery and package management for Kubernetes.
https://carvel.dev/kapp-controller
Apache License 2.0
271 stars 105 forks source link

installation of 10 packages concurrently forces GKE control plane autoscaling #448

Open aaronshurley opened 2 years ago

aaronshurley commented 2 years ago

What steps did you take: Reported from other users: Installed kapp-controller (as a part of a larger product, TAP) on a new GKE cluster.

What happened: During the installation, the Kubernetes control plane became unavailable for several minutes. This caused package installs to enter a ReconcileFailed state. Eventually, when the API server became available, packages reconciled again to completion.

What did you expect: The installation works without any control plane unavailability.

Anything else you would like to add:

Environment:


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

πŸ‘ "I would like to see this addressed as soon as possible" πŸ‘Ž "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

danielhelfand commented 2 years ago

I think a logical next step here would be two try and reproduce this with kapp-controller itself on GKE.

We should also test with TAP to verify any fix works as expected.

Could this be improved by adjusting the kapp-controller's concurrency config (default is 10, what if we reduced it to 5)?

A note here that this is current configurable via a kapp-controller flag: https://github.com/vmware-tanzu/carvel-kapp-controller/blob/1ad808d8909d49f0dff35ee49d285fb5f0e4693f/cmd/main.go#L25

danielhelfand commented 2 years ago

@cppforlife filed a support ticket with GCP and this was the response:

GKE autoscales control plane -- not a configurable thing. so question becomes can we find that threshold (and potentially avoid hitting) after which GKE decides to scale.

One potential solution to this would be documenting kapp-controlller installation on GKE and advising to set concurrency to a lower amount (e.g. 5). This way we could still keep current default of 10.

cppforlife commented 2 years ago

^ do we know that 5 for example, "fixes" the behaviour?

neil-hickey commented 1 year ago

I believe this is just the accepted behaviour of GKE. We cannot change it as the main nodes are scaled and managed by google. Anything further ? @cppforlife