GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
890 stars 218 forks source link

ContainerNodePool resource doesn't have accurate NodeCount set when Autoscaling #163

Open nitishkrishna opened 4 years ago

nitishkrishna commented 4 years ago

My cluster has a NodePool created with ConfigConnector with Autoscaling Limits set min=1 max=3.

I had scaled this up using a test deployment and noticed that NodeCount on the Resource was set at 2 when there were actually 3 nodes in each zone.

Screen Shot 2020-05-04 at 11 42 45 AM

When I turned off autoscaling on this resource, it actually deleted nodes.

Is there a known time lag where the ConfigConnector Controller takes time to update NodeCount on the resource? When creating a new NodePool it waits till all Nodes are Ready to set this field but that didn't happen in this (autoscaling) case.

ConfigConnector Version 1.7.0

To Reproduce

YAML snippets:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
    cnrm.cloud.google.com/project-id: xxx
  generateName: sample-nodepool-
  labels:
    nodepool: sample-nodepool
  name: sample-nodepool-szqv4
  namespace: default
spec:
  clusterRef:
    external: xyz
  initialNodeCount: 1
  location: us-west1
  management: {}
  maxPodsPerNode: 110
  nodeConfig:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS
    machineType: n1-standard-1
    metadata:
      additionalProperties: ""
      disable-legacy-endpoints: "true"
    minCpuPlatform: Intel Skylake
    oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    serviceAccountRef:
      external: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
  nodeCount: 2
  upgradeSettings:
    maxSurge: 1
    maxUnavailable: 0
  version: 1.15.11-gke.9
nitishkrishna commented 4 years ago

I also noticed that the ContainerNodePool Controller tries to delete nodes even when the max of Autoscaling is increased (say to 5) because the NodeCount was wrong (2 instead of 3). Why is an increase to autoscaling also causing the deletion of nodes?

jonnylangefeld commented 4 years ago

Can confirm, I tried the following:

I would expect that the additional nodes are just added. But instead it tries to scale down to the nodeCount value of the containernodepool resource, which it received automatically (1 in my case, seen through kubectl describe). Since that number is lower, it gets stuck because at this point my workload is way too high, but it's trying to scale down. I can see that it's scaling down on the GCP console, seeing two nodes cordoning. I could only unfreeze it by scaling down my sample workload so it fit on that one node, the two cordoning nodes went away, I resized my sample workload to a high number and it scaled up to 4.

To test the dependency to the nodeCount value, I did the exact same test, this time I removed the nodeCount line together with changing autoscaling to 1-4 on kubectl edit and it didn't scale down first, but it only added the 4th node, as expected. The nodeCount value on the containernodepool resource was 3 at this point. Number of nodes on GCP console was 4 however.

Expected behavior would be that if I don't have nodeCount in my yaml, it shouldn't taken into consideration by the config connector.

kibbles-n-bytes commented 4 years ago

We currently are using what is defaulted back in the API server as the declared source of truth. Unfortunately, this is one of the scenarios that causes fighting with autoscaling/autoupdating services. We're patching these cases as we find them currently.

The scenarios we currently support are:

ContainerNodePool is the next on our immediate list, and we'll update this bug with more information once a fix is in place.

We are working on a generic solution at the moment that will handle all cases and allow defaulted values to float with the underlying API.

jonnylangefeld commented 4 years ago

Hi @kibbles-n-bytes, that provokes the general question: is the ContainerNodePool spec designed to be the desired state or to be a reflection of the world's actual state? I'd expect it to be the desired state as you change it to provoke an update that you desire (i.g. update max on autoscaling). However that conflicts with the fact that some parameters are updated from the other side, the world's actual state (the API server). In that case the desired state might get overwritten with the world's actual state and nothing changes. If it's a reflection of the world's actual state, then why are only a few parameters updated? And where should the desired spec go in that case? (terms 'desired' and 'world's actual state' borrowed from the kubernetes Writing Controllers page)

kibbles-n-bytes commented 4 years ago

Hey @jonnylangefeld ,

Short answer is: it's intended to be the desired state.

Config Connector is built to feel as if there is no layer between the underlying API and it, and that the underlying API just behaved as if it were "Kubernetes-native". This is why the spec of the resource represents the full set of the configurable fields of the underlying API; just like how core Kubernetes resources like Pods do defaulting and mutations in the spec beyond what was directly kubectl apply'd in your config, we do the same with the fields that were defaulted by the underlying API upon resource creation.

However, what we do today is bring those field values back into the spec and then treat them with the same priority for drift detection and prevention as the fields you explicitly set. This usually is fine for resources that you want to be completely stable and to always represent what they looked like at resource creation unless explicitly modified in your API server. However, we've been finding these cases where it's actually desired for the underlying resource to be unstable; some examples are: release channels and auto-node upgrades in GKE, and disk autoresizes in SQL. They can be managed completely declaratively in the Kubernetes API server as the full source of truth, but we recognize that there's a large convenience in having GCP be able to update some things on your behalf.

Long-term, we're looking to leverage server-side apply's concept of managers to be more nuanced about what is the ultimate desired state. This would mean that defaulted fields wouldn't be considered managed by you, but instead by Config Connector, and we can choose to treat them with a different priority for drift detection and prevention. So, for instance, if you set autoscaling but didn't set a specific node count, we'd ensure that autoscaling on the underlying API is enforced, but the explicit node count moment-to-moment could float; note that the spec would still represent the resource's desired state, but in the case of specifically the node count field, it's Config Connector's desired state, and that desired state would be what GKE auto-updated to.

We're still fleshing out the design, so in the short-term, we're trying to address the specific cases customers bring up to make sure you're not blocked on the full story being implemented.

jonnylangefeld commented 4 years ago

Hey, @kibbles-n-bytes, thanks for your reply! The problem with changing it from both ends (the actual world from the API server vs. the controller/human trying to set the desired state) is, that the desired state can be overwritten at any time. For instance if I decide to edit the containernodepool spec and delete autoscaling, but set nodeCount, my defined node count will get overwritten right away, cause the actual state of the world still has autoscaling on and the Config Connector decides to set nodeCount in my spec. My desired state is lost then. This would also make interactions between different attributes of the spec invisible to the operator. It might be obvious in the autoscaling & nodeCount case, but how do I know which other fields interact with each other and which ones can be set by the Config Connector?

The comparison to Pods makes a lot of sense, however in kubernetes Pods are the actual world and the ReplicaSet would be the desired state. The ReplicaSet would update its status with the real world, but never it's spec (the pod template):

Replicas:       3 current / 3 desired
Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/name=sample-workload
                    pod-template-hash=59b99fd74
  Service Account:  sample-workload
  Containers:
   webapp:
    Image:        dockercloud/hello-world
    Port:         80/TCP
    Host Port:    0/TCP
    Liveness:     http-get http://:80/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:80/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>

Since the Config Connector updates the spec based on the real world, that would be the same as kubectl editing the image of a Pod and expect the ReplicaSet to reflect that new image. Which we of course don't want.

Right now in Config Connector the state of the real world is being merged with the spec of the desired state. However I think these two need to be separated to make the Config Connector integratable with other controllers. To not have an invalid spec (for instance set autoscaling and nodeCount at the same time) is the admission controller's job.

kibbles-n-bytes commented 4 years ago

(Note that things discussed here are more future-looking. Today we just have the behavior that everything in the API server is the desired state, with initial defaulted fields being preserved permanently.)

Hey, @kibbles-n-bytes, thanks for your reply!

No problem :)

This gets into the weeds a bit with the implementation, but you can be confident we wouldn't blow away your desired state in the API server. Optimistic concurrency via metadata.resourceVersion will ensure we know we’d never race with customer spec updates, and the managers info in metadata.managedFields would let our controller confidently know what fields it should be considered responsible of versus what others are claiming management over. We'd never stomp over values in the spec that we know were explicitly set as being managed by others, and we'd only take over management of fields that all other managers never asserted control over or explicitly relinquished control of.

For instance if I decide to edit the containernodepool spec and delete autoscaling, but set nodeCount, my defined node count will get overwritten right away, cause the actual state of the world still has autoscaling on and the Config Connector decides to set nodeCount in my spec. My desired state is lost then.

In this case, assuming you were the only non-CC manager, you would be relinquishing control of autoscaling to GCP but you would be asserting control over node count, which would cause CC to fight as it would enforce your desired node count. Your desired state would never be lost. If you want to switch from autoscaling to manual, you'd want to fully specify autoscaling: false as well as nodeCount: N if you want to keep things manual. That way, CC would ensure autoscaling is never unintentionally flipped back on.

The comparison to Pods makes a lot of sense, however in kubernetes Pods are the actual world and the ReplicaSet would be the desired state. The ReplicaSet would update its status with the real world, but never it's spec

The ReplicaSet->Pod model is a bit different, as it is unapologetically a two-layer approach with a one-to-many, mutable-manager-with-immutable-children relationship. I was more-so relating to where Pods could have any number of admission controllers tweaking it after you apply its manifest, and if you look at a Pod’s spec after it’s been persisted, many things (e.g. volume mounts for service account credentials, priority, tolerations) are auto-set. In this case, CC has a one-to-one, mutable-to-mutable relationship with the underlying API’s resource, and because of this we want the UX to feel as consistent as possible with a native Kubernetes implementation of that resource.

Right now in Config Connector the state of the real world is being merged with the spec of the desired state. However I think these two need to be separated to make the Config Connector integratable with other controllers.

If explicit desired state is always fully preserved (and merged, in the case of multiple appliers), and things defaulted back from GCP are always just the other things that no k8s manager has expressed an opinion on, then you can fully reason about what the state of the system is. If controllers want specific values for certain fields, all they need to do is specify them in their applied config, and they can be confident we will try to uphold them unless another k8s controller or user changes them in the API server. (We do need to add a status.observedGeneration field to our resources, though, to make things complete)

tonybenchsci commented 4 years ago

XPOST: https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/142#issuecomment-625592830.

I suspect that you have unpinned spec/nodeVersion of ContainerClusters but still pin spec/version of ContainerNodePools